2001-04-22 16:32:47

by David L. Parsley

[permalink] [raw]
Subject: hundreds of mount --bind mountpoints?

Hi,

I'm still working on a packaging system for diskless (quasi-embedded)
devices. The root filesystem is all tmpfs, and I attach packages inside
it. Since symlinks in a tmpfs filesystem cost 4k each (ouch!), I'm
considering using mount --bind for everything. This appears to use very
little memory, but I'm wondering if I'll run into problems when I start
having many hundreds of bind mountings. Any feel for this?

regards,
David

--
David L. Parsley
Roanoke College Network Administrator


2001-04-22 16:41:38

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Sun, 22 Apr 2001, David L. Parsley wrote:

> Hi,
>
> I'm still working on a packaging system for diskless (quasi-embedded)
> devices. The root filesystem is all tmpfs, and I attach packages inside
> it. Since symlinks in a tmpfs filesystem cost 4k each (ouch!), I'm
> considering using mount --bind for everything. This appears to use very
> little memory, but I'm wondering if I'll run into problems when I start
> having many hundreds of bind mountings. Any feel for this?

Memory use is sizeof(struct vfsmount) per binding. In principle, you can get
in trouble when size of /proc/mount will get past 4Kb - you'll get only
first 4 (actually 3, IIRC) kilobytes, so stuff that relies on the contents
of said file may get unhappy. It's fixable, though.

2001-04-23 12:00:39

by Christoph Rohland

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Hi David,

On Sun, 22 Apr 2001, David L. Parsley wrote:
> I'm still working on a packaging system for diskless
> (quasi-embedded) devices. The root filesystem is all tmpfs, and I
> attach packages inside it. Since symlinks in a tmpfs filesystem
> cost 4k each (ouch!), I'm considering using mount --bind for
> everything.

What about fixing tmpfs instead?

Greetings
Christoph


2001-04-23 13:19:03

by Ingo Oeser

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Mon, Apr 23, 2001 at 01:43:27PM +0200, Christoph Rohland wrote:
> On Sun, 22 Apr 2001, David L. Parsley wrote:
> > attach packages inside it. Since symlinks in a tmpfs filesystem
> > cost 4k each (ouch!), I'm considering using mount --bind for
> > everything.
>
> What about fixing tmpfs instead?

The question is: How? If you do it like ramfs, you cannot swap
these symlinks and this is effectively a mlock(symlink) operation
allowed for normal users. -> BAD!

One idea is to only use a page, if the entry will be pushed into
swap and thus only wasting swap, not memory (where we have more
of it).

But allocating a page on memory pressure is also not a bright
idea.

OTOH we could force this entry to swap immedately, after we
copied it from the dentry. So we can do an GFP_ATOMIC allocation
and do not too much harm to memory pressure and only make the IO
a bit stormier.

I think there are a lot of races, which I don't see now.

So please don't beat me too much, if this is a completly stupid
idea, ok? ;-)


Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-23 13:55:56

by David L. Parsley

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Christoph Rohland wrote:
>
> Hi David,
>
> On Sun, 22 Apr 2001, David L. Parsley wrote:
> > I'm still working on a packaging system for diskless
> > (quasi-embedded) devices. The root filesystem is all tmpfs, and I
> > attach packages inside it. Since symlinks in a tmpfs filesystem
> > cost 4k each (ouch!), I'm considering using mount --bind for
> > everything.
>
> What about fixing tmpfs instead?

That would be great - are you volunteering? ;-) Seriously - I might be
able to look at what ramfs does and port that to tmpfs for my needs, but
that's about the extent of my kernel hacking skills. For now, mount
--bind looks like it'll work just fine. If somebody wants to fix tmpfs,
I'll be happy to test patches; it'll just change a couple of lines in my
package loading logic (mount --bind x y -> ln -s x y).

What I'm not sure of is which solution is actually 'better' - I'm
guessing that performance-wise, neither will make a noticable
difference, so I guess memory usage would be the deciding factor. If I
can get a lot closer to the size of a symlink (10-20 bytes) that would
be best. The issue with /proc/mounts really shouldn't hurt anything - I
could almost get by without mounting /proc anyway, it's mainly a
convenience.

regards,
David

--
David L. Parsley
Network Administrator
Roanoke College

2001-04-23 14:14:50

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Mon, 23 Apr 2001, David L. Parsley wrote:

> What I'm not sure of is which solution is actually 'better' - I'm
> guessing that performance-wise, neither will make a noticable
> difference, so I guess memory usage would be the deciding factor. If I

Bindings are faster on lookup. For obvious reasons - in case of symlinks
you do name resolution every time you traverse the link; in case of
bindings it is done when you create them.

2001-04-23 15:11:52

by Christoph Rohland

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Hi Ingo,

On Mon, 23 Apr 2001, Ingo Oeser wrote:
> On Mon, Apr 23, 2001 at 01:43:27PM +0200, Christoph Rohland wrote:
>> On Sun, 22 Apr 2001, David L. Parsley wrote:
>> > attach packages inside it. Since symlinks in a tmpfs filesystem
>> > cost 4k each (ouch!), I'm considering using mount --bind for
>> > everything.
>>
>> What about fixing tmpfs instead?
>
> The question is: How? If you do it like ramfs, you cannot swap
> these symlinks and this is effectively a mlock(symlink) operation
> allowed for normal users. -> BAD!

How about storing it into the inode structure if it fits into the
fs-private union? If it is too big we allocate the page as we do it
now. The union has 192 bytes. This should be sufficient for most
cases.

Greetings
Christoph


2001-04-23 15:24:03

by Ingo Oeser

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Hi Chris,

On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
> > The question is: How? If you do it like ramfs, you cannot swap
> > these symlinks and this is effectively a mlock(symlink) operation
> > allowed for normal users. -> BAD!
>
> How about storing it into the inode structure if it fits into the
> fs-private union? If it is too big we allocate the page as we do it
> now. The union has 192 bytes. This should be sufficient for most
> cases.

Great idea. We allocate this space anyway. And we don't have to
care about the internals of this union, because never have to use
it outside the kernel ;-)

I like it. ext2fs does the same, so there should be no VFS
hassles involved. Al?

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-23 15:37:43

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Mon, 23 Apr 2001, Ingo Oeser wrote:

> Hi Chris,
>
> On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
> > > The question is: How? If you do it like ramfs, you cannot swap
> > > these symlinks and this is effectively a mlock(symlink) operation
> > > allowed for normal users. -> BAD!
> >
> > How about storing it into the inode structure if it fits into the
> > fs-private union? If it is too big we allocate the page as we do it
> > now. The union has 192 bytes. This should be sufficient for most
> > cases.
>
> Great idea. We allocate this space anyway. And we don't have to
> care about the internals of this union, because never have to use
> it outside the kernel ;-)
>
> I like it. ext2fs does the same, so there should be no VFS
> hassles involved. Al?

We should get ext2 and friends to move the sucker _out_ of struct inode.
As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
it really has to be done. More filesystems adding stuff into the union
is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
->clear_inode() is the right place for freeing it.

2001-04-23 20:45:31

by Ingo Oeser

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
> > Great idea. We allocate this space anyway. And we don't have to
> > care about the internals of this union, because never have to use
> > it outside the kernel ;-)
> >
> > I like it. ext2fs does the same, so there should be no VFS
> > hassles involved. Al?
>
> We should get ext2 and friends to move the sucker _out_ of struct inode.
> As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> it really has to be done. More filesystems adding stuff into the union
> is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> ->clear_inode() is the right place for freeing it.

You need an inode anyway. So why not using the space in it? tmpfs
would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
this kind of symlinks.

Last time we suggested this, people ended up with some OS trying
it and getting worse performance.

Why? You need to allocate the VFS-inode (vnode in other OSs) and
the on-disk-inode anyway at the same time. You get better
performance and less fragmentation, if you allocate them both
together[1].

So that struct inode around is ok.

BTW: Is it still less than one page? Then it doesn't make me
nervous. Why? Guess what granularity we allocate at, if we
just store pointers instead of the inode.u. Or do you like
every FS creating his own slab cache?

Regards

Ingo Oeser

[1] Which is true for other allocations, too.
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-23 20:57:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

In article <[email protected]> you wrote:
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance.

Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].

Because having an union in generic code that includes filesystem-specific
memebers is ugly? It's one of those a little more performance for a lot of
bad style optimizations.

Christoph


--
Of course it doesn't work. We've performed a software upgrade.
Whip me. Beat me. Make me maintain AIX.

2001-04-23 21:20:11

by Richard Gooch

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Ingo Oeser writes:
> On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
> > > Great idea. We allocate this space anyway. And we don't have to
> > > care about the internals of this union, because never have to use
> > > it outside the kernel ;-)
> > >
> > > I like it. ext2fs does the same, so there should be no VFS
> > > hassles involved. Al?
> >
> > We should get ext2 and friends to move the sucker _out_ of struct inode.
> > As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> > it really has to be done. More filesystems adding stuff into the union
> > is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> > ->clear_inode() is the right place for freeing it.
>
> You need an inode anyway. So why not using the space in it? tmpfs
> would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
> this kind of symlinks.
>
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance.
>
> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].

We want to take out that union because it sucks for virtual
filesystems. Besides, it's ugly.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-23 22:00:31

by Ingo Oeser

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Mon, Apr 23, 2001 at 10:56:16PM +0200, Christoph Hellwig wrote:
> In article <[email protected]> you wrote:
> > Last time we suggested this, people ended up with some OS trying
> > it and getting worse performance.
>
> Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

Don't remember. I think Larry McVoy told the story, so I cc'ed
him ;-)

> Because having an union in generic code that includes filesystem-specific
> memebers is ugly? It's one of those a little more performance for a lot of
> bad style optimizations.

We have this kind of stuff all over the place. If we allocate
some small amount of memory and and need some small amount
associated with this memory, there is no problem with a little
waste.

Waste is better than fragmentation. This is the lesson people
learned from segments in the ia32.

Objects are easier to manage, if they are the same size.

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-23 22:11:12

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Tue, 24 Apr 2001, Ingo Oeser wrote:

> We have this kind of stuff all over the place. If we allocate
> some small amount of memory and and need some small amount
> associated with this memory, there is no problem with a little
> waste.

Little? How about quarter of kilobyte per inode? sizeof(struct inode)
is nearly half-kilobyte. And icache can easily get to ~100000 elements.

> Waste is better than fragmentation. This is the lesson people
> learned from segments in the ia32.
>
> Objects are easier to manage, if they are the same size.

So don't keep them in the same cache. Notice that quite a few systems
keep vnode separately from fs-specific data. For a very good reason.

Al

2001-04-23 22:44:12

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Richard Gooch writes:

> We want to take out that union because it sucks for virtual
> filesystems. Besides, it's ugly.

I hope you won't mind if people trash this with benchmarks.

2001-04-23 22:52:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Ingo Oeser writes:
> > We should get ext2 and friends to move the sucker _out_ of struct inode.
> > As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> > it really has to be done. More filesystems adding stuff into the union
> > is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> > ->clear_inode() is the right place for freeing it.
>
> BTW: Is it still less than one page? Then it doesn't make me
> nervous. Why? Guess what granularity we allocate at, if we
> just store pointers instead of the inode.u. Or do you like
> every FS creating his own slab cache?

I would much rather we allocate a slab cache for each fs type (it
would be trivial at register_fs time). Most people have only a limited
number of filesystems active at a single time, yet tens or hundreds of
thousands of inodes in the inode slab cache. Making the per-fs private
inode data as small as possible would reduce memory wastage considerably,
and not impact performance (AFAICS) if we use a per-fs type slab cache
for fs private data.

Consider, when I was doing some fs benchmark, my inode slab cache was
over 120k items on a 128MB machine. At 480 butes per inode, this is
almost 58 MB, close to half of RAM. Reducing this to exactly ext2
sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
systems)!!! (This assumes nfs_inode_info is the largest).

This also makes it possible to safely (and efficiently) use external
filesystem modules without the need to recompile the kernel. Granted,
if the external filesystem doesn't use more than the largest .u struct,
then it is currently possible as well, but that number changes, so it
is not safe.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-23 22:52:17

by Richard Gooch

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Albert D. Cahalan writes:
> Richard Gooch writes:
>
> > We want to take out that union because it sucks for virtual
> > filesystems. Besides, it's ugly.
>
> I hope you won't mind if people trash this with benchmarks.

But they can't. At least, not for a well designed patch. If there is a
real issue of fragmentation, then there are ways to fix that without
using a bloated union structure. Don't punish some filesystems just
because others have a problem.

Solutions to avoid fragmentation:

- keep a separate VFSinode and FSinode slab cache
- allocate an enlarged VFSinode that contains the FSinode at the end,
with the generic pointer in the VFSinode part pointing to FSinode
part.

It's simply wrong to bloat everyone because some random FS found it
easier to thow in a union.

Besides, for every benchmark that shows how fragmentation hurts, I can
generate a benchmark showing how inode bloat hurts. Lies, damn lies
and benchmarks.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-23 23:16:36

by Richard Gooch

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Alexander Viro writes:
>
>
> On Mon, 23 Apr 2001, Richard Gooch wrote:
>
> > - keep a separate VFSinode and FSinode slab cache
>
> Yup.
>
> > - allocate an enlarged VFSinode that contains the FSinode at the end,
> > with the generic pointer in the VFSinode part pointing to FSinode
> > part.
>
> Please, don't. It would help with bloat only if you allocated these
> beasts separately for each fs and then you end up with _many_ allocators
> that can generate pointer to struct inode.
>
> "One type - one allocator" is a good rule - violating it turns into
> major PITA couple of years down the road 9 times out of 10.

Agreed. The better option is the separate VFSinode and FSinode caches.
The enlarged inode scheme is also ugly, like the unions. It's just
less bloated :-)

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-23 23:10:05

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Mon, 23 Apr 2001, Richard Gooch wrote:

> - keep a separate VFSinode and FSinode slab cache

Yup.

> - allocate an enlarged VFSinode that contains the FSinode at the end,
> with the generic pointer in the VFSinode part pointing to FSinode
> part.

Please, don't. It would help with bloat only if you allocated these
beasts separately for each fs and then you end up with _many_ allocators
that can generate pointer to struct inode.

"One type - one allocator" is a good rule - violating it turns into major
PITA couple of years down the road 9 times out of 10.

2001-04-23 23:26:02

by Rik van Riel

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Mon, 23 Apr 2001, Alexander Viro wrote:
> On Mon, 23 Apr 2001, Richard Gooch wrote:
>
> > - keep a separate VFSinode and FSinode slab cache
>
> Yup.

Would it make sense to unify these with the struct
address_space ?

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-04-24 00:16:25

by Ed Tomlinson

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Andreas Dilger wrote:

> Consider, when I was doing some fs benchmark, my inode slab cache was
> over 120k items on a 128MB machine. At 480 butes per inode, this is
> almost 58 MB, close to half of RAM. Reducing this to exactly ext2
> sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
> systems)!!! (This assumes nfs_inode_info is the largest).

Was this with a recient kernel (post Alexander Viro's dcache pressure fix)?
If not I suggest rerunning the benchmark. I had/have a patch to apply pressure
to the dcache and icache from kswapd but its not been needed here since the above
fix.

Ed Tomlinson <[email protected]>

2001-04-24 01:37:54

by Jan Harkes

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance.
>
> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].
>
> So that struct inode around is ok.
>
> BTW: Is it still less than one page? Then it doesn't make me
> nervous. Why? Guess what granularity we allocate at, if we
> just store pointers instead of the inode.u. Or do you like
> every FS creating his own slab cache?

I've actually got the coda_inode_info (inode->u.u_coda_fs_i) split out
of the union in my development kernel. It doesn't shrink the size of the
struct inode yet, Coda isn't the biggest user at the moment.

But, it forced me to do several cleanups in the code, and it even has
resulted in fixing a 'leak'. Not a real memory loss leak one, but we
left uninitialized inodes around in the icache for no good reason. Also
changing a but in a coda specific header file does trigger an almost
complete rebuild of the whole kernel (coda.h -> coda_fs_i.h -> fs.h ->
everything?)

The allocation overhead really isn't that bad. kmalloc/kfree are also
using the slabcache, and a trivial variant using a 'private' slabcache
gave me the counters to find the 'leak' I mentioned before.

I can't really evaluate performance impacts. The struct inode is still
the same size, so for now there even is a little bit of additional
memory pressure. Also, Coda wasn't really developed to achieve high
performance but more to explore novel features.

Jan

2001-04-24 02:53:41

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Mon, 23 Apr 2001, Jan Harkes wrote:

> On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

> > BTW: Is it still less than one page? Then it doesn't make me
> > nervous. Why? Guess what granularity we allocate at, if we
> > just store pointers instead of the inode.u. Or do you like
> > every FS creating his own slab cache?

Oh, for crying out loud. All it takes is half an hour per filesystem.
Here - completely untested patch that does it for NFS. Took about that
long. Absolutely straightforward, very easy to verify correctness.

Some stuff may need tweaking, but not much (e.g. some functions
should take nfs_inode_info instead of inodes, etc.). From the look
of flushd cache it seems that we would be better off with cyclic
lists instead of single-linked ones for the hash, but I didn't look
deep enough.

So consider the patch below as proof-of-concept. Enjoy:

diff -urN S4-pre6/fs/nfs/flushd.c S4-pre6-nfs/fs/nfs/flushd.c
--- S4-pre6/fs/nfs/flushd.c Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/flushd.c Mon Apr 23 22:23:11 2001
@@ -162,11 +162,11 @@

if (NFS_FLAGS(inode) & NFS_INO_FLUSH)
goto out;
- inode->u.nfs_i.hash_next = NULL;
+ NFS_I(inode)->hash_next = NULL;

q = &cache->inodes;
while (*q)
- q = &(*q)->u.nfs_i.hash_next;
+ q = &NFS_I(*q)->hash_next;
*q = inode;

/* Note: we increase the inode i_count in order to prevent
@@ -188,9 +188,9 @@

q = &cache->inodes;
while (*q && *q != inode)
- q = &(*q)->u.nfs_i.hash_next;
+ q = &NFS_I(*q)->hash_next;
if (*q) {
- *q = inode->u.nfs_i.hash_next;
+ *q = NFS_I(inode)->hash_next;
NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
iput(inode);
}
@@ -238,8 +238,8 @@
cache->inodes = NULL;

while ((inode = next) != NULL) {
- next = next->u.nfs_i.hash_next;
- inode->u.nfs_i.hash_next = NULL;
+ next = NFS_I(next)->hash_next;
+ NFS_I(inode)->hash_next = NULL;
NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;

if (flush) {
diff -urN S4-pre6/fs/nfs/inode.c S4-pre6-nfs/fs/nfs/inode.c
--- S4-pre6/fs/nfs/inode.c Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/inode.c Mon Apr 23 22:43:45 2001
@@ -40,11 +40,14 @@
#define NFSDBG_FACILITY NFSDBG_VFS
#define NFS_PARANOIA 1

+static kmem_cache_t *nfs_inode_cachep;
+
static struct inode * __nfs_fhget(struct super_block *, struct nfs_fh *, struct nfs_fattr *);
void nfs_zap_caches(struct inode *);
static void nfs_invalidate_inode(struct inode *);

static void nfs_read_inode(struct inode *);
+static void nfs_clear_inode(struct inode *);
static void nfs_delete_inode(struct inode *);
static void nfs_put_super(struct super_block *);
static void nfs_umount_begin(struct super_block *);
@@ -52,6 +55,7 @@

static struct super_operations nfs_sops = {
read_inode: nfs_read_inode,
+ clear_inode: nfs_clear_inode,
put_inode: force_delete,
delete_inode: nfs_delete_inode,
put_super: nfs_put_super,
@@ -96,23 +100,44 @@
static void
nfs_read_inode(struct inode * inode)
{
+ struct nfs_inode_info *nfsi;
+
+ nfsi = kmem_cache_alloc(nfs_inode_cachep, GFP_KERNEL);
+ if (!nfsi)
+ goto Enomem;
+
inode->i_blksize = inode->i_sb->s_blocksize;
inode->i_mode = 0;
inode->i_rdev = 0;
+ inode->u.generic_ip = nfsi;
NFS_FILEID(inode) = 0;
NFS_FSID(inode) = 0;
NFS_FLAGS(inode) = 0;
- INIT_LIST_HEAD(&inode->u.nfs_i.read);
- INIT_LIST_HEAD(&inode->u.nfs_i.dirty);
- INIT_LIST_HEAD(&inode->u.nfs_i.commit);
- INIT_LIST_HEAD(&inode->u.nfs_i.writeback);
- inode->u.nfs_i.nread = 0;
- inode->u.nfs_i.ndirty = 0;
- inode->u.nfs_i.ncommit = 0;
- inode->u.nfs_i.npages = 0;
+ INIT_LIST_HEAD(&nfsi->read);
+ INIT_LIST_HEAD(&nfsi->dirty);
+ INIT_LIST_HEAD(&nfsi->commit);
+ INIT_LIST_HEAD(&nfsi->writeback);
+ nfsi->nread = 0;
+ nfsi->ndirty = 0;
+ nfsi->ncommit = 0;
+ nfsi->npages = 0;
NFS_CACHEINV(inode);
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
+ return;
+
+Enomem:
+ make_bad_inode(inode);
+ return;
+}
+
+static void
+nfs_clear_inode(struct inode * inode)
+{
+ struct nfs_inode_info *p = NFS_I(inode);
+ inode->u.generic_ip = NULL;
+ if (p)
+ kmem_cache_free(nfs_inode_cachep, p);
}

static void
@@ -594,7 +619,7 @@
NFS_CACHE_ISIZE(inode) = fattr->size;
NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
- memcpy(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh));
+ memcpy(NFS_FH(inode), fh, sizeof(struct nfs_fh));
}
nfs_refresh_inode(inode, fattr);
}
@@ -621,7 +646,7 @@
return 0;
if (NFS_FILEID(inode) != fattr->fileid)
return 0;
- if (memcmp(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh)) != 0)
+ if (memcmp(NFS_FH(inode), fh, sizeof(struct nfs_fh)) != 0)
return 0;
return 1;
}
@@ -640,7 +665,7 @@
return 1;

/* Has the filehandle changed? If so is the old one stale? */
- if (memcmp(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh)) != 0 &&
+ if (memcmp(NFS_FH(inode), fh, sizeof(struct nfs_fh)) != 0 &&
__nfs_revalidate_inode(NFS_SERVER(inode),inode) == -ESTALE)
return 1;

@@ -1056,6 +1081,24 @@
extern int nfs_init_readpagecache(void);
extern int nfs_destroy_readpagecache(void);

+int nfs_init_inodecache(void)
+{
+ nfs_inode_cachep = kmem_cache_create("nfs_inode_cache",
+ sizeof(struct nfs_inode_info),
+ 0, SLAB_HWCACHE_ALIGN,
+ NULL, NULL);
+ if (nfs_inode_cachep == NULL)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void nfs_destroy_inodecache(void)
+{
+ if (kmem_cache_destroy(nfs_inode_cachep))
+ printk(KERN_INFO "nfs_inode_cache: not all structures were freed\n");
+}
+
/*
* Initialize NFS
*/
@@ -1067,6 +1110,10 @@
if (err)
return err;

+ err = nfs_init_inodecache();
+ if (err)
+ return err;
+
err = nfs_init_readpagecache();
if (err)
return err;
@@ -1080,6 +1127,7 @@
static void __exit exit_nfs_fs(void)
{
nfs_destroy_readpagecache();
+ nfs_destroy_inodecache();
nfs_destroy_nfspagecache();
#ifdef CONFIG_PROC_FS
rpc_proc_unregister("nfs");
diff -urN S4-pre6/fs/nfs/read.c S4-pre6-nfs/fs/nfs/read.c
--- S4-pre6/fs/nfs/read.c Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/read.c Mon Apr 23 22:18:25 2001
@@ -153,7 +153,7 @@
{
struct list_head *head, *next;

- head = &inode->u.nfs_i.read;
+ head = &NFS_I(inode)->read;
next = head->next;
while (next != head) {
struct nfs_page *req = nfs_list_entry(next);
@@ -183,11 +183,12 @@
nfs_mark_request_read(struct nfs_page *req)
{
struct inode *inode = req->wb_inode;
+ struct nfs_inode_info *nfsi = NFS_I(inode);

spin_lock(&nfs_wreq_lock);
if (list_empty(&req->wb_list)) {
- nfs_list_add_request(req, &inode->u.nfs_i.read);
- inode->u.nfs_i.nread++;
+ nfs_list_add_request(req, &nfsi->read);
+ nfsi->nread++;
}
spin_unlock(&nfs_wreq_lock);
/*
@@ -234,7 +235,7 @@
break;
}

- if (inode->u.nfs_i.nread >= NFS_SERVER(inode)->rpages ||
+ if (NFS_I(inode)->nread >= NFS_SERVER(inode)->rpages ||
page_index(page) == (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
nfs_pagein_inode(inode, 0, 0);
if (new)
@@ -372,10 +373,11 @@
nfs_scan_read_timeout(struct inode *inode, struct list_head *dst)
{
int pages;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
spin_lock(&nfs_wreq_lock);
- pages = nfs_scan_list_timeout(&inode->u.nfs_i.read, dst, inode);
- inode->u.nfs_i.nread -= pages;
- if ((inode->u.nfs_i.nread == 0) != list_empty(&inode->u.nfs_i.read))
+ pages = nfs_scan_list_timeout(&nfsi->read, dst, inode);
+ nfsi->nread -= pages;
+ if ((nfsi->nread == 0) != list_empty(&nfsi->read))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.nread.\n");
spin_unlock(&nfs_wreq_lock);
return pages;
@@ -385,10 +387,11 @@
nfs_scan_read(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
{
int res;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
spin_lock(&nfs_wreq_lock);
- res = nfs_scan_list(&inode->u.nfs_i.read, dst, NULL, idx_start, npages);
- inode->u.nfs_i.nread -= res;
- if ((inode->u.nfs_i.nread == 0) != list_empty(&inode->u.nfs_i.read))
+ res = nfs_scan_list(&nfsi->read, dst, NULL, idx_start, npages);
+ nfsi->nread -= res;
+ if ((nfsi->nread == 0) != list_empty(&nfsi->read))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.nread.\n");
spin_unlock(&nfs_wreq_lock);
return res;
diff -urN S4-pre6/fs/nfs/write.c S4-pre6-nfs/fs/nfs/write.c
--- S4-pre6/fs/nfs/write.c Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/write.c Mon Apr 23 22:15:06 2001
@@ -329,14 +329,15 @@
static inline void
nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
{
+ struct nfs_inode_info *nfsi = NFS_I(inode);
if (!list_empty(&req->wb_hash))
return;
if (!NFS_WBACK_BUSY(req))
printk(KERN_ERR "NFS: unlocked request attempted hashed!\n");
- if (list_empty(&inode->u.nfs_i.writeback))
+ if (list_empty(&nfsi->writeback))
atomic_inc(&inode->i_count);
- inode->u.nfs_i.npages++;
- list_add(&req->wb_hash, &inode->u.nfs_i.writeback);
+ nfsi->npages++;
+ list_add(&req->wb_hash, &nfsi->writeback);
req->wb_count++;
}

@@ -347,6 +348,7 @@
nfs_inode_remove_request(struct nfs_page *req)
{
struct inode *inode;
+ struct nfs_inode_info *nfsi;
spin_lock(&nfs_wreq_lock);
if (list_empty(&req->wb_hash)) {
spin_unlock(&nfs_wreq_lock);
@@ -355,12 +357,13 @@
if (!NFS_WBACK_BUSY(req))
printk(KERN_ERR "NFS: unlocked request attempted unhashed!\n");
inode = req->wb_inode;
+ nfsi = NFS_I(inode);
list_del(&req->wb_hash);
INIT_LIST_HEAD(&req->wb_hash);
- inode->u.nfs_i.npages--;
- if ((inode->u.nfs_i.npages == 0) != list_empty(&inode->u.nfs_i.writeback))
+ nfsi->npages--;
+ if ((nfsi->npages == 0) != list_empty(&nfsi->writeback))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.npages.\n");
- if (list_empty(&inode->u.nfs_i.writeback))
+ if (list_empty(&nfsi->writeback))
iput(inode);
if (!nfs_have_writebacks(inode) && !nfs_have_read(inode))
inode_remove_flushd(inode);
@@ -376,7 +379,7 @@
{
struct list_head *head, *next;

- head = &inode->u.nfs_i.writeback;
+ head = &NFS_I(inode)->writeback;
next = head->next;
while (next != head) {
struct nfs_page *req = nfs_inode_wb_entry(next);
@@ -448,8 +451,9 @@

spin_lock(&nfs_wreq_lock);
if (list_empty(&req->wb_list)) {
- nfs_list_add_request(req, &inode->u.nfs_i.dirty);
- inode->u.nfs_i.ndirty++;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
+ nfs_list_add_request(req, &nfsi->dirty);
+ nfsi->ndirty++;
}
spin_unlock(&nfs_wreq_lock);
/*
@@ -466,7 +470,7 @@
nfs_dirty_request(struct nfs_page *req)
{
struct inode *inode = req->wb_inode;
- return !list_empty(&req->wb_list) && req->wb_list_head == &inode->u.nfs_i.dirty;
+ return !list_empty(&req->wb_list) && req->wb_list_head == &NFS_I(inode)->dirty;
}

#ifdef CONFIG_NFS_V3
@@ -480,8 +484,9 @@

spin_lock(&nfs_wreq_lock);
if (list_empty(&req->wb_list)) {
- nfs_list_add_request(req, &inode->u.nfs_i.commit);
- inode->u.nfs_i.ncommit++;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
+ nfs_list_add_request(req, &nfsi->commit);
+ nfsi->ncommit++;
}
spin_unlock(&nfs_wreq_lock);
/*
@@ -657,7 +662,7 @@
idx_end = idx_start + npages - 1;

spin_lock(&nfs_wreq_lock);
- head = &inode->u.nfs_i.writeback;
+ head = &NFS_I(inode)->writeback;
p = head->next;
while (p != head) {
unsigned long pg_idx;
@@ -720,10 +725,11 @@
nfs_scan_dirty_timeout(struct inode *inode, struct list_head *dst)
{
int pages;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
spin_lock(&nfs_wreq_lock);
- pages = nfs_scan_list_timeout(&inode->u.nfs_i.dirty, dst, inode);
- inode->u.nfs_i.ndirty -= pages;
- if ((inode->u.nfs_i.ndirty == 0) != list_empty(&inode->u.nfs_i.dirty))
+ pages = nfs_scan_list_timeout(&nfsi->dirty, dst, inode);
+ nfsi->ndirty -= pages;
+ if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
spin_unlock(&nfs_wreq_lock);
return pages;
@@ -734,10 +740,11 @@
nfs_scan_commit_timeout(struct inode *inode, struct list_head *dst)
{
int pages;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
spin_lock(&nfs_wreq_lock);
- pages = nfs_scan_list_timeout(&inode->u.nfs_i.commit, dst, inode);
- inode->u.nfs_i.ncommit -= pages;
- if ((inode->u.nfs_i.ncommit == 0) != list_empty(&inode->u.nfs_i.commit))
+ pages = nfs_scan_list_timeout(&nfsi->commit, dst, inode);
+ nfsi->ncommit -= pages;
+ if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
spin_unlock(&nfs_wreq_lock);
return pages;
@@ -783,10 +790,11 @@
nfs_scan_dirty(struct inode *inode, struct list_head *dst, struct file *file, unsigned long idx_start, unsigned int npages)
{
int res;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
spin_lock(&nfs_wreq_lock);
- res = nfs_scan_list(&inode->u.nfs_i.dirty, dst, file, idx_start, npages);
- inode->u.nfs_i.ndirty -= res;
- if ((inode->u.nfs_i.ndirty == 0) != list_empty(&inode->u.nfs_i.dirty))
+ res = nfs_scan_list(&nfsi->dirty, dst, file, idx_start, npages);
+ nfsi->ndirty -= res;
+ if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
spin_unlock(&nfs_wreq_lock);
return res;
@@ -797,10 +805,11 @@
nfs_scan_commit(struct inode *inode, struct list_head *dst, struct file *file, unsigned long idx_start, unsigned int npages)
{
int res;
+ struct nfs_inode_info *nfsi = NFS_I(inode);
spin_lock(&nfs_wreq_lock);
- res = nfs_scan_list(&inode->u.nfs_i.commit, dst, file, idx_start, npages);
- inode->u.nfs_i.ncommit -= res;
- if ((inode->u.nfs_i.ncommit == 0) != list_empty(&inode->u.nfs_i.commit))
+ res = nfs_scan_list(&nfsi->commit, dst, file, idx_start, npages);
+ nfsi->ncommit -= res;
+ if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
spin_unlock(&nfs_wreq_lock);
return res;
@@ -885,7 +894,7 @@
/*
* If we're over the soft limit, flush out old requests
*/
- if (inode->u.nfs_i.npages >= MAX_REQUEST_SOFT)
+ if (NFS_I(inode)->npages >= MAX_REQUEST_SOFT)
nfs_wb_file(inode, file);
new = nfs_create_request(file, inode, page, offset, bytes);
if (!new)
@@ -952,8 +961,9 @@
nfs_strategy(struct inode *inode)
{
unsigned int dirty, wpages;
+ struct nfs_inode_info *nfsi = NFS_I(inode);

- dirty = inode->u.nfs_i.ndirty;
+ dirty = nfsi->ndirty;
wpages = NFS_SERVER(inode)->wpages;
#ifdef CONFIG_NFS_V3
if (NFS_PROTO(inode)->version == 2) {
@@ -962,7 +972,7 @@
} else {
if (dirty >= wpages)
nfs_flush_file(inode, NULL, 0, 0, 0);
- if (inode->u.nfs_i.ncommit > NFS_STRATEGY_PAGES * wpages &&
+ if (nfsi->ncommit > NFS_STRATEGY_PAGES * wpages &&
atomic_read(&nfs_nr_requests) > MAX_REQUEST_SOFT)
nfs_commit_file(inode, NULL, 0, 0, 0);
}
@@ -974,7 +984,7 @@
* If we're running out of free requests, flush out everything
* in order to reduce memory useage...
*/
- if (inode->u.nfs_i.npages > MAX_REQUEST_SOFT)
+ if (nfsi->npages > MAX_REQUEST_SOFT)
nfs_wb_all(inode);
}

@@ -1141,7 +1151,7 @@
/* Set up the argument struct */
nfs_write_rpcsetup(head, data);
if (stable) {
- if (!inode->u.nfs_i.ncommit)
+ if (!NFS_I(inode)->ncommit)
data->args.stable = NFS_FILE_SYNC;
else
data->args.stable = NFS_DATA_SYNC;
diff -urN S4-pre6/include/linux/fs.h S4-pre6-nfs/include/linux/fs.h
--- S4-pre6/include/linux/fs.h Sat Apr 21 14:35:32 2001
+++ S4-pre6-nfs/include/linux/fs.h Mon Apr 23 22:40:10 2001
@@ -448,7 +448,6 @@
struct msdos_inode_info msdos_i;
struct umsdos_inode_info umsdos_i;
struct iso_inode_info isofs_i;
- struct nfs_inode_info nfs_i;
struct sysv_inode_info sysv_i;
struct affs_inode_info affs_i;
struct ufs_inode_info ufs_i;
diff -urN S4-pre6/include/linux/nfs_fs.h S4-pre6-nfs/include/linux/nfs_fs.h
--- S4-pre6/include/linux/nfs_fs.h Sat Apr 21 14:35:32 2001
+++ S4-pre6-nfs/include/linux/nfs_fs.h Mon Apr 23 22:40:17 2001
@@ -63,39 +63,44 @@
*/
#define NFS_SUPER_MAGIC 0x6969

-#define NFS_FH(inode) (&(inode)->u.nfs_i.fh)
+static inline struct nfs_inode_info *NFS_I(struct inode *inode)
+{
+ return (struct nfs_inode_info *)inode->u.generic_ip;
+}
+
+#define NFS_FH(inode) (&NFS_I(inode)->fh)
#define NFS_SERVER(inode) (&(inode)->i_sb->u.nfs_sb.s_server)
#define NFS_CLIENT(inode) (NFS_SERVER(inode)->client)
#define NFS_PROTO(inode) (NFS_SERVER(inode)->rpc_ops)
#define NFS_REQUESTLIST(inode) (NFS_SERVER(inode)->rw_requests)
#define NFS_ADDR(inode) (RPC_PEERADDR(NFS_CLIENT(inode)))
#define NFS_CONGESTED(inode) (RPC_CONGESTED(NFS_CLIENT(inode)))
-#define NFS_COOKIEVERF(inode) ((inode)->u.nfs_i.cookieverf)
-#define NFS_READTIME(inode) ((inode)->u.nfs_i.read_cache_jiffies)
-#define NFS_CACHE_CTIME(inode) ((inode)->u.nfs_i.read_cache_ctime)
-#define NFS_CACHE_MTIME(inode) ((inode)->u.nfs_i.read_cache_mtime)
-#define NFS_CACHE_ATIME(inode) ((inode)->u.nfs_i.read_cache_atime)
-#define NFS_CACHE_ISIZE(inode) ((inode)->u.nfs_i.read_cache_isize)
-#define NFS_NEXTSCAN(inode) ((inode)->u.nfs_i.nextscan)
+#define NFS_COOKIEVERF(inode) (NFS_I(inode)->cookieverf)
+#define NFS_READTIME(inode) (NFS_I(inode)->read_cache_jiffies)
+#define NFS_CACHE_CTIME(inode) (NFS_I(inode)->read_cache_ctime)
+#define NFS_CACHE_MTIME(inode) (NFS_I(inode)->read_cache_mtime)
+#define NFS_CACHE_ATIME(inode) (NFS_I(inode)->read_cache_atime)
+#define NFS_CACHE_ISIZE(inode) (NFS_I(inode)->read_cache_isize)
+#define NFS_NEXTSCAN(inode) (NFS_I(inode)->nextscan)
#define NFS_CACHEINV(inode) \
do { \
NFS_READTIME(inode) = jiffies - NFS_MAXATTRTIMEO(inode) - 1; \
} while (0)
-#define NFS_ATTRTIMEO(inode) ((inode)->u.nfs_i.attrtimeo)
+#define NFS_ATTRTIMEO(inode) (NFS_I(inode)->attrtimeo)
#define NFS_MINATTRTIMEO(inode) \
(S_ISDIR(inode->i_mode)? NFS_SERVER(inode)->acdirmin \
: NFS_SERVER(inode)->acregmin)
#define NFS_MAXATTRTIMEO(inode) \
(S_ISDIR(inode->i_mode)? NFS_SERVER(inode)->acdirmax \
: NFS_SERVER(inode)->acregmax)
-#define NFS_ATTRTIMEO_UPDATE(inode) ((inode)->u.nfs_i.attrtimeo_timestamp)
+#define NFS_ATTRTIMEO_UPDATE(inode) (NFS_I(inode)->attrtimeo_timestamp)

-#define NFS_FLAGS(inode) ((inode)->u.nfs_i.flags)
+#define NFS_FLAGS(inode) (NFS_I(inode)->flags)
#define NFS_REVALIDATING(inode) (NFS_FLAGS(inode) & NFS_INO_REVALIDATING)
#define NFS_STALE(inode) (NFS_FLAGS(inode) & NFS_INO_STALE)

-#define NFS_FILEID(inode) ((inode)->u.nfs_i.fileid)
-#define NFS_FSID(inode) ((inode)->u.nfs_i.fsid)
+#define NFS_FILEID(inode) (NFS_I(inode)->fileid)
+#define NFS_FSID(inode) (NFS_I(inode)->fsid)

/* Inode Flags */
#define NFS_USE_READDIRPLUS(inode) ((NFS_FLAGS(inode) & NFS_INO_ADVISE_RDPLUS) ? 1 : 0)
@@ -212,13 +217,13 @@
static inline int
nfs_have_read(struct inode *inode)
{
- return !list_empty(&inode->u.nfs_i.read);
+ return !list_empty(&NFS_I(inode)->read);
}

static inline int
nfs_have_writebacks(struct inode *inode)
{
- return !list_empty(&inode->u.nfs_i.writeback);
+ return !list_empty(&NFS_I(inode)->writeback);
}

static inline int




2001-04-24 03:57:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Ed Tomlinson writes:
> > Consider, when I was doing some fs benchmark, my inode slab cache was
> > over 120k items on a 128MB machine. At 480 butes per inode, this is
> > almost 58 MB, close to half of RAM. Reducing this to exactly ext2
> > sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
> > systems)!!! (This assumes nfs_inode_info is the largest).
>
> Was this with a recient kernel (post Alexander Viro's dcache pressure fix)?
> If not I suggest rerunning the benchmark. I had/have a patch to apply
> pressure to the dcache and icache from kswapd but its not been needed here
> since the above fix.

Actually, it had the dcache patch but I'm not aware of a patch from Al to
change icache behaviour. In any case, changing the icache behaviour is
not what I'm getting at here - having the union of all private inode
structs in the generic inode is a huge waste of RAM. Even for filesystems
that are heavy NFS users, they will likely still have a considerable amount
of local filesystem space (excluding NFS root systems, which are very few).

Al posted a patch to the NFS code which removes nfs_inode_info from the
inode union. Since it is (AFAIK) the largest member of the union, we
have just saved 24 bytes per inode (hfs_inode_info is also rather large).
If we removed hfs_inode_info as well, we would save 108 bytes per inode,
about 22% ({ext2,affs,ufs}_inode_info are all about the same size).

No point in punishing all users for filesystems they don't necessarily use.
Even for people that DO use NFS and/or HFS, they are probably still wasting
10k inodes * 108 bytes = 1MB of RAM for no good reason (because most of
their inodes are probably not NFS and/or HFS).

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-24 06:51:35

by Christoph Rohland

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Hi Alexander,

On Mon, 23 Apr 2001, Alexander Viro wrote:
>> I like it. ext2fs does the same, so there should be no VFS
>> hassles involved. Al?
>
> We should get ext2 and friends to move the sucker _out_ of struct
> inode. As it is, sizeof(struct inode) is way too large. This is 2.5
> stuff, but it really has to be done. More filesystems adding stuff
> into the union is a Bad Thing(tm). If you want to allocates space -
> allocate if yourself; ->clear_inode() is the right place for freeing
> it.

Yes, I agree that the union is way too large and I did not plan to
extend it but simply use the size it has.

if (strlen(path) < sizeof(inode->u))
inline the symlink;
else
put it into the page cache;

So if somebody really cleans up the private inode structures it will
not trigger that often any more and we perhaps have to rethink the
idea.

But also if we use struct shmem_inode_info which is 92 bytes right now
we would inline all symlinks on my machine.

If we reduced its size to 32 (which could be easily done) we would
still inline 6642 out of 9317 symlinks on my machine. That's not bad.

Greetings
Christoph


2001-04-24 09:59:54

by David Woodhouse

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?


[email protected] said:
> Oh, for crying out loud. All it takes is half an hour per filesystem.

Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
surprised.

--
dwmw2


2001-04-24 10:01:35

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Mon, 23 Apr 2001, Andreas Dilger wrote:

> Al posted a patch to the NFS code which removes nfs_inode_info from the
> inode union. Since it is (AFAIK) the largest member of the union, we
> have just saved 24 bytes per inode (hfs_inode_info is also rather large).
> If we removed hfs_inode_info as well, we would save 108 bytes per inode,
> about 22% ({ext2,affs,ufs}_inode_info are all about the same size).

For fsck sake! HFS patch. Time: 14 minutes, including checking that sucker
builds (it had most of the accesses to ->u.hfs_i already encapsulated).

What I really don't understand is why the hell people keep coming up
with the grand and convoluted plans of removing the inode bloat and
nobleedinone of them actually cared to sit down and do the simplest variant
possible.

I can certainly go through the rest of filesystems and even do a testing
for most of them, but WTF? Could the rest of you please join the show?
It's not a fscking rocket science - encapsulate accesses to ->u.foofs_i
into inlined function, find ->read_inode, find places that do get_empty_inode()
or new_inode(), add allocation there, add freeing to ->clear_inode()
(defining one if needed), change that inlined function so that it would
return ->u.generic_ip and you are done. Clean the results up and test
them. Furrfu...

It's not like it was a global change that affected the whole kernel -
at every step changes are local to one filesystem and changes for
different filesystems are independent from each other. If at some point
in 2.5 .generic_ip is the only member of union - fine, we just do
%s/u.generic_ip/fs_inode/g
or something like that. Moreover, if maintainer of filesystem foo is
OK with change it _can_ be done in 2.4 - it doesn't affect anything
outside of foofs.

Guys, doing all these patches is ~20 man-hours. And that's bloody generous
estimate. Looking through the results and doing necessary tweaking
(as in "hmm... we keep passing pointer to inode through the long chain
of functions and all of them need only fs-specific part", etc.) - about
the same. Verifiying that thing wasn't fucked up - maybe an hour or two of
audit per filesystem (split the patch into encapsulation part - trivial
to verify - and the rest - pretty small). Grrr...

Oh, well... Initial HFS patch follows:

diff -urN S4-pre6/fs/hfs/inode.c S4-pre6-hfs/fs/hfs/inode.c
--- S4-pre6/fs/hfs/inode.c Fri Feb 16 22:55:36 2001
+++ S4-pre6-hfs/fs/hfs/inode.c Tue Apr 24 05:10:21 2001
@@ -231,7 +231,7 @@
static int hfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
{
return cont_prepare_write(page,from,to,hfs_get_block,
- &page->mapping->host->u.hfs_i.mmu_private);
+ &HFS_I(page->mapping->host)->mmu_private);
}
static int hfs_bmap(struct address_space *mapping, long block)
{
@@ -309,7 +309,7 @@
return NULL;
}

- if (inode->i_dev != sb->s_dev) {
+ if (inode->i_dev != sb->s_dev || !HFS_I(inode)) {
iput(inode); /* automatically does an hfs_cat_put */
inode = NULL;
} else if (!inode->i_mode || (*sys_entry == NULL)) {
@@ -373,7 +373,7 @@
inode->i_op = &hfs_file_inode_operations;
inode->i_fop = &hfs_file_operations;
inode->i_mapping->a_ops = &hfs_aops;
- inode->u.hfs_i.mmu_private = inode->i_size;
+ HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = &entry->u.dir;

@@ -433,7 +433,7 @@
inode->i_op = &hfs_file_inode_operations;
inode->i_fop = &hfs_file_operations;
inode->i_mapping->a_ops = &hfs_aops;
- inode->u.hfs_i.mmu_private = inode->i_size;
+ HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = &entry->u.dir;

@@ -479,7 +479,7 @@
inode->i_op = &hfs_file_inode_operations;
inode->i_fop = &hfs_file_operations;
inode->i_mapping->a_ops = &hfs_aops;
- inode->u.hfs_i.mmu_private = inode->i_size;
+ HFS_I(inode)->mmu_private = inode->i_size;
} else { /* Directory */
struct hfs_dir *hdir = &entry->u.dir;

diff -urN S4-pre6/fs/hfs/super.c S4-pre6-hfs/fs/hfs/super.c
--- S4-pre6/fs/hfs/super.c Sat Apr 21 14:35:20 2001
+++ S4-pre6-hfs/fs/hfs/super.c Tue Apr 24 05:26:04 2001
@@ -35,6 +35,7 @@
/*================ Forward declarations ================*/

static void hfs_read_inode(struct inode *);
+static void hfs_clear_inode(struct inode *);
static void hfs_put_super(struct super_block *);
static int hfs_statfs(struct super_block *, struct statfs *);
static void hfs_write_super(struct super_block *);
@@ -43,6 +44,7 @@

static struct super_operations hfs_super_operations = {
read_inode: hfs_read_inode,
+ clear_inode: hfs_clear_inode,
put_inode: hfs_put_inode,
put_super: hfs_put_super,
write_super: hfs_write_super,
@@ -52,6 +54,7 @@
/*================ File-local variables ================*/

static DECLARE_FSTYPE_DEV(hfs_fs, "hfs", hfs_read_super);
+static kmem_cache_t *hfs_cachep;

/*================ File-local functions ================*/

@@ -64,6 +67,15 @@
static void hfs_read_inode(struct inode *inode)
{
inode->i_mode = 0;
+ inode->u.generic_ip = kmem_cache_alloc(hfs_cachep, SLAB_KERNEL);
+}
+
+static void hfs_clear_inode(struct inode *inode)
+{
+ struct hfs_inode_info *hfsi = HFS_I(inode);
+ inode->u.generic_ip = NULL;
+ if (hfsi)
+ kmem_cache_free(hfs_cachep, hfsi);
}

/*
@@ -475,12 +487,20 @@
static int __init init_hfs_fs(void)
{
hfs_cat_init();
+ hfs_cachep = kmem_cache_create("hfs_inodes",
+ sizeof(struct hfs_inode_info),
+ 0, SLAB_HWCACHE_ALIGN,
+ NULL, NULL);
+ if (hfs_cachep == NULL)
+ return -ENOMEM;
return register_filesystem(&hfs_fs);
}

static void __exit exit_hfs_fs(void) {
hfs_cat_free();
unregister_filesystem(&hfs_fs);
+ if (kmem_cache_destroy(hfs_cachep))
+ printk(KERN_INFO "hfs_inodes: not all structures were freed\n");
}

module_init(init_hfs_fs)
diff -urN S4-pre6/include/linux/fs.h S4-pre6-hfs/include/linux/fs.h
--- S4-pre6/include/linux/fs.h Sat Apr 21 14:35:32 2001
+++ S4-pre6-hfs/include/linux/fs.h Tue Apr 24 05:21:34 2001
@@ -293,7 +293,6 @@
#include <linux/romfs_fs_i.h>
#include <linux/shmem_fs.h>
#include <linux/smb_fs_i.h>
-#include <linux/hfs_fs_i.h>
#include <linux/adfs_fs_i.h>
#include <linux/qnx4_fs_i.h>
#include <linux/reiserfs_fs_i.h>
@@ -457,7 +456,6 @@
struct shmem_inode_info shmem_i;
struct coda_inode_info coda_i;
struct smb_inode_info smbfs_i;
- struct hfs_inode_info hfs_i;
struct adfs_inode_info adfs_i;
struct qnx4_inode_info qnx4_i;
struct reiserfs_inode_info reiserfs_i;
diff -urN S4-pre6/include/linux/hfs_fs.h S4-pre6-hfs/include/linux/hfs_fs.h
--- S4-pre6/include/linux/hfs_fs.h Fri Feb 16 22:55:40 2001
+++ S4-pre6-hfs/include/linux/hfs_fs.h Tue Apr 24 05:24:18 2001
@@ -317,7 +317,11 @@
extern int hfs_mac2triv(char *, const struct hfs_name *);
extern void hfs_tolower(unsigned char *, int);

-#define HFS_I(X) (&((X)->u.hfs_i))
+static inline struct hfs_inode_info *HFS_I(struct inode *inode)
+{
+ return (struct hfs_inode_info *)inode->u.generic_ip;
+}
+
#define HFS_SB(X) (&((X)->u.hfs_sb))

static inline void hfs_nameout(struct inode *dir, struct hfs_name *out,

2001-04-24 10:08:05

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Tue, 24 Apr 2001, David Woodhouse wrote:

>
> [email protected] said:
> > Oh, for crying out loud. All it takes is half an hour per filesystem.
>
> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
> surprised.

<tone polite> What's stopping you? </tone>
You _are_ JFFS maintainer, aren't you?

2001-04-24 10:37:59

by Christoph Rohland

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
>> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
>> be very surprised.
>
> <tone polite> What's stopping you? </tone>
> You _are_ JFFS maintainer, aren't you?

So is this the start to change all filesystems in 2.4? I am not sure
we should do that.

Greetings
Christoph


2001-04-24 10:54:39

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On 24 Apr 2001, Christoph Rohland wrote:

> Hi Al,
>
> On Tue, 24 Apr 2001, Alexander Viro wrote:
> >> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
> >> be very surprised.
> >
> > <tone polite> What's stopping you? </tone>
> > You _are_ JFFS maintainer, aren't you?
>
> So is this the start to change all filesystems in 2.4? I am not sure
> we should do that.

Encapsulation part is definitely worth doing - it cleans the code up
and doesn't change the result of compile. Adding allocation/freeing/
cache initialization/cache removal and chaninging FOOFS_I() definition -
well, it's probably worth to keep such patches around, but whether
to switch any individual filesystem during 2.4 is a policy decision.
Up to maintainer, indeed. Notice that these patches (separate allocation
per se) are going to be within 3-4Kb per filesystem _and_ completely
straightforward.

What I would like to avoid is scenario like

Maintainers of filesystems with large private inodes: Why would we separate
them? We would only waste memory, since the other filesystems stay in ->u
and keep it large.

Maintainers of the rest of filesystems: Since there's no patches that would
take large stuff out of ->u, why would we bother?

So yes, IMO having such patches available _is_ a good thing. And in 2.5
we definitely want them in the tree. If encapsulation part gets there
during 2.4 and separate allocation is available for all of them it will
be easier to do without PITA in process.
Al


2001-04-24 12:54:37

by David Woodhouse

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?


[email protected] said:
> What I would like to avoid is scenario like
> Maintainers of filesystems with large private inodes: Why would we
> separate them? We would only waste memory, since the other filesystems
> stay in ->u and keep it large.

> Maintainers of the rest of filesystems: Since there's no patches that
> would take large stuff out of ->u, why would we bother?

> So yes, IMO having such patches available _is_ a good thing. And in
> 2.5 we definitely want them in the tree. If encapsulation part gets
> there during 2.4 and separate allocation is available for all of them
> it will be easier to do without PITA in process.

JFFS2 has the encapsulation part already. I'll make it do separate
allocation in 2.5, when it's actually a gain.

--
dwmw2


2001-04-24 13:28:38

by Erik Mouw

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Tue, Apr 24, 2001 at 06:01:12AM -0400, Alexander Viro wrote:
> For fsck sake! HFS patch. Time: 14 minutes, including checking that sucker
> builds (it had most of the accesses to ->u.hfs_i already encapsulated).

Al is right, it is no rocket science. Here is a patch against
2.4.4-pre6 for procfs and isofs. It took me an hour to do because I'm
not familiar with the fs code. It compiles, and the procfs code even
runs (sorry, no CDROM player availeble on my embedded StrongARM
system), though it is possible that there are some bugs in it.


Erik

Index: fs/isofs/inode.c
===================================================================
RCS file: /home/erik/cvsroot/elinux/fs/isofs/inode.c,v
retrieving revision 1.1.1.24
diff -d -u -r1.1.1.24 inode.c
--- fs/isofs/inode.c 2001/04/21 21:24:00 1.1.1.24
+++ fs/isofs/inode.c 2001/04/24 13:13:29
@@ -15,6 +15,7 @@
#include <linux/stat.h>
#include <linux/sched.h>
#include <linux/iso_fs.h>
+#include <linux/iso_fs_i.h>
#include <linux/kernel.h>
#include <linux/major.h>
#include <linux/mm.h>
@@ -44,6 +45,8 @@
static int check_bread = 0;
#endif

+static kmem_cache_t *isofs_cachep;
+
static int isofs_hashi(struct dentry *parent, struct qstr *qstr);
static int isofs_hash(struct dentry *parent, struct qstr *qstr);
static int isofs_dentry_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b);
@@ -74,10 +77,12 @@
}

static void isofs_read_inode(struct inode *);
+static void isofs_clear_inode(struct inode *);
static int isofs_statfs (struct super_block *, struct statfs *);

static struct super_operations isofs_sops = {
read_inode: isofs_read_inode,
+ clear_inode: isofs_clear_inode,
put_super: isofs_put_super,
statfs: isofs_statfs,
};
@@ -908,9 +913,9 @@
goto abort_beyond_end;

offset = 0;
- firstext = inode->u.isofs_i.i_first_extent;
- sect_size = inode->u.isofs_i.i_section_size >> ISOFS_BUFFER_BITS(inode);
- nextino = inode->u.isofs_i.i_next_section_ino;
+ firstext = ISOFS_I(inode)->i_first_extent;
+ sect_size = ISOFS_I(inode)->i_section_size >> ISOFS_BUFFER_BITS(inode);
+ nextino = ISOFS_I(inode)->i_next_section_ino;

i = 0;
if (nextino) {
@@ -923,9 +928,9 @@
ninode = iget(inode->i_sb, nextino);
if (!ninode)
goto abort;
- firstext = ninode->u.isofs_i.i_first_extent;
- sect_size = ninode->u.isofs_i.i_section_size;
- nextino = ninode->u.isofs_i.i_next_section_ino;
+ firstext = ISOFS_I(ninode)->i_first_extent;
+ sect_size = ISOFS_I(ninode)->i_section_size;
+ nextino = ISOFS_I(ninode)->i_next_section_ino;
iput(ninode);

if (++i > 100)
@@ -1025,7 +1030,7 @@
struct iso_directory_record * tmpde = NULL;

inode->i_size = 0;
- inode->u.isofs_i.i_next_section_ino = 0;
+ ISOFS_I(inode)->i_next_section_ino = 0;

block = f_pos >> ISOFS_BUFFER_BITS(inode);
offset = f_pos & (bufsize-1);
@@ -1077,7 +1082,7 @@

inode->i_size += isonum_733(de->size);
if (i == 1)
- inode->u.isofs_i.i_next_section_ino = f_pos;
+ ISOFS_I(inode)->i_next_section_ino = f_pos;

more_entries = de->flags[-high_sierra] & 0x80;

@@ -1174,9 +1179,10 @@
inode->i_uid = inode->i_sb->u.isofs_sb.s_uid;
inode->i_gid = inode->i_sb->u.isofs_sb.s_gid;
inode->i_blocks = inode->i_blksize = 0;
+ inode->u.generic_ip = kmem_cache_alloc(isofs_cachep, SLAB_KERNEL);


- inode->u.isofs_i.i_section_size = isonum_733 (de->size);
+ ISOFS_I(inode)->i_section_size = isonum_733 (de->size);
if(de->flags[-high_sierra] & 0x80) {
if(isofs_read_level3_size(inode)) goto fail;
} else {
@@ -1230,7 +1236,7 @@
inode->i_mtime = inode->i_atime = inode->i_ctime =
iso_date(de->date, high_sierra);

- inode->u.isofs_i.i_first_extent = (isonum_733 (de->extent) +
+ ISOFS_I(inode)->i_first_extent = (isonum_733 (de->extent) +
isonum_711 (de->ext_attr_length));

/*
@@ -1298,6 +1304,16 @@
goto out;
}

+
+static void isofs_clear_inode(struct inode *inode)
+{
+ struct iso_inode_info *isofsi = ISOFS_I(inode);
+ inode->u.generic_ip = NULL;
+ if(isofsi)
+ kmem_cache_free(isofs_cachep, isofsi);
+}
+
+
#ifdef LEAK_CHECK
#undef malloc
#undef free_s
@@ -1332,12 +1348,21 @@

static int __init init_iso9660_fs(void)
{
+ isofs_cachep = kmem_cache_create("isofs_inodes",
+ sizeof(struct iso_inode_info),
+ 0, SLAB_HWCACHE_ALIGN,
+ NULL, NULL);
+ if(isofs_cachep == NULL)
+ return -ENOMEM;
+
return register_filesystem(&iso9660_fs_type);
}

static void __exit exit_iso9660_fs(void)
{
unregister_filesystem(&iso9660_fs_type);
+ if(kmem_cache_destroy(isofs_cachep))
+ printk(KERN_INFO "isofs_inodes: not all structures were freed\n");
}

EXPORT_NO_SYMBOLS;
Index: fs/isofs/namei.c
===================================================================
RCS file: /home/erik/cvsroot/elinux/fs/isofs/namei.c,v
retrieving revision 1.1.1.10
diff -d -u -r1.1.1.10 namei.c
--- fs/isofs/namei.c 2001/02/21 14:46:02 1.1.1.10
+++ fs/isofs/namei.c 2001/04/24 13:13:29
@@ -8,6 +8,7 @@

#include <linux/sched.h>
#include <linux/iso_fs.h>
+#include <linux/iso_fs_i.h>
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/stat.h>
@@ -65,7 +66,7 @@
unsigned int block, f_pos, offset;
struct buffer_head * bh = NULL;

- if (!dir->u.isofs_i.i_first_extent)
+ if (!ISOFS_I(dir)->i_first_extent)
return 0;

f_pos = 0;
Index: fs/isofs/rock.c
===================================================================
RCS file: /home/erik/cvsroot/elinux/fs/isofs/rock.c,v
retrieving revision 1.1.1.53
diff -d -u -r1.1.1.53 rock.c
--- fs/isofs/rock.c 2001/04/21 21:24:00 1.1.1.53
+++ fs/isofs/rock.c 2001/04/24 13:13:29
@@ -9,6 +9,7 @@
#include <linux/stat.h>
#include <linux/sched.h>
#include <linux/iso_fs.h>
+#include <linux/iso_fs_i.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/slab.h>
@@ -349,9 +350,9 @@
printk(KERN_WARNING "Attempt to read inode for relocated directory\n");
goto out;
case SIG('C','L'):
- inode->u.isofs_i.i_first_extent = isonum_733(rr->u.CL.location);
+ ISOFS_I(inode)->i_first_extent = isonum_733(rr->u.CL.location);
reloc = iget(inode->i_sb,
- (inode->u.isofs_i.i_first_extent <<
+ (ISOFS_I(inode)->i_first_extent <<
inode -> i_sb -> u.isofs_sb.s_log_zone_size));
if (!reloc)
goto out;
Index: fs/proc/base.c
===================================================================
RCS file: /home/erik/cvsroot/elinux/fs/proc/base.c,v
retrieving revision 1.1.1.7
diff -d -u -r1.1.1.7 base.c
--- fs/proc/base.c 2001/04/08 21:51:17 1.1.1.7
+++ fs/proc/base.c 2001/04/24 13:13:29
@@ -19,6 +19,7 @@
#include <linux/errno.h>
#include <linux/sched.h>
#include <linux/proc_fs.h>
+#include <linux/proc_fs_i.h>
#include <linux/stat.h>
#include <linux/init.h>
#include <linux/file.h>
@@ -42,9 +43,9 @@

static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
{
- if (inode->u.proc_i.file) {
- *mnt = mntget(inode->u.proc_i.file->f_vfsmnt);
- *dentry = dget(inode->u.proc_i.file->f_dentry);
+ if (PROCFS_I(inode)->file) {
+ *mnt = mntget(PROCFS_I(inode)->file->f_vfsmnt);
+ *dentry = dget(PROCFS_I(inode)->file->f_dentry);
return 0;
}
return -ENOENT;
@@ -55,7 +56,7 @@
struct mm_struct * mm;
struct vm_area_struct * vma;
int result = -ENOENT;
- struct task_struct *task = inode->u.proc_i.task;
+ struct task_struct *task = PROCFS_I(inode)->task;

task_lock(task);
mm = task->mm;
@@ -86,11 +87,11 @@
{
struct fs_struct *fs;
int result = -ENOENT;
- task_lock(inode->u.proc_i.task);
- fs = inode->u.proc_i.task->fs;
+ task_lock(PROCFS_I(inode)->task);
+ fs = PROCFS_I(inode)->task->fs;
if(fs)
atomic_inc(&fs->count);
- task_unlock(inode->u.proc_i.task);
+ task_unlock(PROCFS_I(inode)->task);
if (fs) {
read_lock(&fs->lock);
*mnt = mntget(fs->pwdmnt);
@@ -106,11 +107,11 @@
{
struct fs_struct *fs;
int result = -ENOENT;
- task_lock(inode->u.proc_i.task);
- fs = inode->u.proc_i.task->fs;
+ task_lock(PROCFS_I(inode)->task);
+ fs = PROCFS_I(inode)->task->fs;
if(fs)
atomic_inc(&fs->count);
- task_unlock(inode->u.proc_i.task);
+ task_unlock(PROCFS_I(inode)->task);
if (fs) {
read_lock(&fs->lock);
*mnt = mntget(fs->rootmnt);
@@ -258,7 +259,7 @@
size_t count, loff_t *ppos)
{
struct inode * inode = file->f_dentry->d_inode;
- struct task_struct *task = inode->u.proc_i.task;
+ struct task_struct *task = PROCFS_I(inode)->task;
ssize_t res;

res = proc_pid_read_maps(task, file, buf, count, ppos);
@@ -278,14 +279,14 @@
unsigned long page;
ssize_t length;
ssize_t end;
- struct task_struct *task = inode->u.proc_i.task;
+ struct task_struct *task = PROCFS_I(inode)->task;

if (count > PROC_BLOCK_SIZE)
count = PROC_BLOCK_SIZE;
if (!(page = __get_free_page(GFP_KERNEL)))
return -ENOMEM;

- length = inode->u.proc_i.op.proc_read(task, (char*)page);
+ length = PROCFS_I(inode)->op.proc_read(task, (char*)page);

if (length < 0) {
free_page(page);
@@ -315,7 +316,7 @@
static ssize_t mem_read(struct file * file, char * buf,
size_t count, loff_t *ppos)
{
- struct task_struct *task = file->f_dentry->d_inode->u.proc_i.task;
+ struct task_struct *task = PROCFS_I(file->f_dentry->d_inode)->task;
char *page;
unsigned long src = *ppos;
int copied = 0;
@@ -360,7 +361,7 @@
{
int copied = 0;
char *page;
- struct task_struct *task = file->f_dentry->d_inode->u.proc_i.task;
+ struct task_struct *task = PROCFS_I(file->f_dentry->d_inode)->.task;
unsigned long dst = *ppos;

if (!MAY_PTRACE(task))
@@ -418,7 +419,7 @@
if (error)
goto out;

- error = inode->u.proc_i.op.proc_get_link(inode, &nd->dentry, &nd->mnt);
+ error = PROCFS_I(inode)->op.proc_get_link(inode, &nd->dentry, &nd->mnt);
nd->last_type = LAST_BIND;
out:
return error;
@@ -458,7 +459,7 @@
if (error)
goto out;

- error = inode->u.proc_i.op.proc_get_link(inode, &de, &mnt);
+ error = PROCFS_I(inode)->op.proc_get_link(inode, &de, &mnt);
if (error)
goto out;

@@ -523,7 +524,7 @@
static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
{
struct inode *inode = filp->f_dentry->d_inode;
- struct task_struct *p = inode->u.proc_i.task;
+ struct task_struct *p = PROCFS_I(inode)->task;
unsigned int fd, pid, ino;
int retval;
char buf[NUMBUF];
@@ -585,8 +586,8 @@
struct inode *inode = filp->f_dentry->d_inode;
struct pid_entry *p;

- pid = inode->u.proc_i.task->pid;
- if (!inode->u.proc_i.task->p_pptr)
+ pid = PROCFS_I(inode)->task->pid;
+ if (!PROCFS_I(inode)->task->p_pptr)
return -ENOENT;
i = filp->f_pos;
switch (i) {
@@ -632,14 +633,16 @@

/* Common stuff */

+
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
inode->i_ino = fake_ino(task->pid, ino);
+ inode->u.generic_ip = kmem_cache_alloc(proc_cachep, SLAB_KERNEL);

- inode->u.proc_i.file = NULL;
+ PROCFS_I(inode)->file = NULL;
/*
* grab the reference to task.
*/
- inode->u.proc_i.task = task;
+ PROCFS_I(inode)->task = task;
get_task_struct(task);
if (!task->p_pptr)
goto out_unlock;
@@ -673,7 +676,7 @@
*/
static int pid_base_revalidate(struct dentry * dentry, int flags)
{
- if (dentry->d_inode->u.proc_i.task->p_pptr)
+ if (PROCFS_I(dentry->d_inode)->task->p_pptr)
return 1;
d_drop(dentry);
return 0;
@@ -707,7 +710,7 @@
static struct dentry *proc_lookupfd(struct inode * dir, struct dentry * dentry)
{
unsigned int fd, c;
- struct task_struct *task = dir->u.proc_i.task;
+ struct task_struct *task = PROCFS_I(dir)->task;
struct file * file;
struct files_struct * files;
struct inode *inode;
@@ -740,7 +743,7 @@
if (!files)
goto out_unlock;
read_lock(&files->file_lock);
- file = inode->u.proc_i.file = fcheck_files(files, fd);
+ file = PROCFS_I(inode)->file = fcheck_files(files, fd);
if (!file)
goto out_unlock2;
get_file(file);
@@ -749,7 +752,7 @@
inode->i_op = &proc_pid_link_inode_operations;
inode->i_size = 64;
inode->i_mode = S_IFLNK;
- inode->u.proc_i.op.proc_get_link = proc_fd_link;
+ PROCFS_I(inode)->op.proc_get_link = proc_fd_link;
if (file->f_mode & 1)
inode->i_mode |= S_IRUSR | S_IXUSR;
if (file->f_mode & 2)
@@ -784,7 +787,7 @@
{
struct inode *inode;
int error;
- struct task_struct *task = dir->u.proc_i.task;
+ struct task_struct *task = PROCFS_I(dir)->task;
struct pid_entry *p;

error = -ENOENT;
@@ -817,35 +820,35 @@
break;
case PROC_PID_EXE:
inode->i_op = &proc_pid_link_inode_operations;
- inode->u.proc_i.op.proc_get_link = proc_exe_link;
+ PROCFS_I(inode)->op.proc_get_link = proc_exe_link;
break;
case PROC_PID_CWD:
inode->i_op = &proc_pid_link_inode_operations;
- inode->u.proc_i.op.proc_get_link = proc_cwd_link;
+ PROCFS_I(inode)->op.proc_get_link = proc_cwd_link;
break;
case PROC_PID_ROOT:
inode->i_op = &proc_pid_link_inode_operations;
- inode->u.proc_i.op.proc_get_link = proc_root_link;
+ PROCFS_I(inode)->op.proc_get_link = proc_root_link;
break;
case PROC_PID_ENVIRON:
inode->i_fop = &proc_info_file_operations;
- inode->u.proc_i.op.proc_read = proc_pid_environ;
+ PROCFS_I(inode)->op.proc_read = proc_pid_environ;
break;
case PROC_PID_STATUS:
inode->i_fop = &proc_info_file_operations;
- inode->u.proc_i.op.proc_read = proc_pid_status;
+ PROCFS_I(inode)->op.proc_read = proc_pid_status;
break;
case PROC_PID_STAT:
inode->i_fop = &proc_info_file_operations;
- inode->u.proc_i.op.proc_read = proc_pid_stat;
+ PROCFS_I(inode)->op.proc_read = proc_pid_stat;
break;
case PROC_PID_CMDLINE:
inode->i_fop = &proc_info_file_operations;
- inode->u.proc_i.op.proc_read = proc_pid_cmdline;
+ PROCFS_I(inode)->op.proc_read = proc_pid_cmdline;
break;
case PROC_PID_STATM:
inode->i_fop = &proc_info_file_operations;
- inode->u.proc_i.op.proc_read = proc_pid_statm;
+ PROCFS_I(inode)->op.proc_read = proc_pid_statm;
break;
case PROC_PID_MAPS:
inode->i_fop = &proc_maps_operations;
@@ -853,7 +856,7 @@
#ifdef CONFIG_SMP
case PROC_PID_CPU:
inode->i_fop = &proc_info_file_operations;
- inode->u.proc_i.op.proc_read = proc_pid_cpu;
+ PROCFS_I(inode)->op.proc_read = proc_pid_cpu;
break;
#endif
case PROC_PID_MEM:
@@ -920,9 +923,10 @@
if (!inode)
return ERR_PTR(-ENOMEM);
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+ inode->u.generic_ip = kmem_cache_alloc(proc_cachep, SLAB_KERNEL);
inode->i_ino = fake_ino(0, PROC_PID_INO);
- inode->u.proc_i.file = NULL;
- inode->u.proc_i.task = NULL;
+ PROCFS_I(inode)->file = NULL;
+ PROCFS_I(inode)->task = NULL;
inode->i_mode = S_IFLNK|S_IRWXUGO;
inode->i_uid = inode->i_gid = 0;
inode->i_size = 64;
@@ -972,10 +976,10 @@

void proc_pid_delete_inode(struct inode *inode)
{
- if (inode->u.proc_i.file)
- fput(inode->u.proc_i.file);
- if (inode->u.proc_i.task)
- free_task_struct(inode->u.proc_i.task);
+ if (PROCFS_I(inode)->file)
+ fput(PROCFS_I(inode)->file);
+ if (PROCFS_I(inode)->task)
+ free_task_struct(PROCFS_I(inode)->task);
}

#define PROC_NUMBUF 10
Index: fs/proc/generic.c
===================================================================
RCS file: /home/erik/cvsroot/elinux/fs/proc/generic.c,v
retrieving revision 1.1.1.24
diff -d -u -r1.1.1.24 generic.c
--- fs/proc/generic.c 2001/04/21 21:23:54 1.1.1.24
+++ fs/proc/generic.c 2001/04/24 13:13:29
@@ -445,6 +445,9 @@
const char *fn = name;
int len;

+ if (! (S_ISCHR(mode) || S_ISBLK(mode)))
+ BUG();
+
if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
goto out;
len = strlen(fn);
Index: fs/proc/inode.c
===================================================================
RCS file: /home/erik/cvsroot/elinux/fs/proc/inode.c,v
retrieving revision 1.1.1.7
diff -d -u -r1.1.1.7 inode.c
--- fs/proc/inode.c 2001/04/21 15:05:27 1.1.1.7
+++ fs/proc/inode.c 2001/04/24 13:13:29
@@ -80,6 +80,14 @@
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
}

+static void proc_clear_inode(struct inode *inode)
+{
+ struct proc_inode_info *proci = PROCFS_I(inode);
+ inode->u.generic_ip = NULL;
+ if (proci)
+ kmem_cache_free(proc_cachep, proci);
+}
+
static int proc_statfs(struct super_block *sb, struct statfs *buf)
{
buf->f_type = PROC_SUPER_MAGIC;
@@ -93,6 +101,7 @@

static struct super_operations proc_sops = {
read_inode: proc_read_inode,
+ clear_inode: proc_clear_inode,
put_inode: force_delete,
delete_inode: proc_delete_inode,
statfs: proc_statfs,
Index: fs/proc/procfs_syms.c
===================================================================
RCS file: /home/erik/cvsroot/elinux/fs/proc/procfs_syms.c,v
retrieving revision 1.1.1.12
diff -d -u -r1.1.1.12 procfs_syms.c
--- fs/proc/procfs_syms.c 2001/04/21 21:23:54 1.1.1.12
+++ fs/proc/procfs_syms.c 2001/04/24 13:13:29
@@ -2,6 +2,7 @@
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/proc_fs.h>
+#include <linux/proc_fs_i.h>
#include <linux/init.h>

extern struct proc_dir_entry *proc_sys_root;
@@ -21,18 +22,37 @@
EXPORT_SYMBOL(proc_root_driver);

static DECLARE_FSTYPE(proc_fs_type, "proc", proc_read_super, FS_SINGLE);
+kmem_cache_t *proc_cachep;

static int __init init_proc_fs(void)
{
- int err = register_filesystem(&proc_fs_type);
- if (!err) {
- proc_mnt = kern_mount(&proc_fs_type);
- err = PTR_ERR(proc_mnt);
- if (IS_ERR(proc_mnt))
- unregister_filesystem(&proc_fs_type);
- else
- err = 0;
+ int err;
+
+ proc_cachep = kmem_cache_create("proc_inodes",
+ sizeof(struct proc_inode_info),
+ 0, SLAB_HWCACHE_ALIGN,
+ NULL, NULL);
+
+ if (proc_cachep == NULL)
+ return -ENOMEM;
+
+ err = register_filesystem(&proc_fs_type);
+ if(err)
+ goto badproc;
+
+ proc_mnt = kern_mount(&proc_fs_type);
+ err = PTR_ERR(proc_mnt);
+ if (IS_ERR(proc_mnt)) {
+ unregister_filesystem(&proc_fs_type);
+ goto badproc;
}
+
+ return 0;
+
+badproc:
+ if (kmem_cache_destroy(proc_cachep))
+ printk(KERN_INFO "proc_inodes: not all structures were freed\n");
+
return err;
}

@@ -40,6 +60,8 @@
{
unregister_filesystem(&proc_fs_type);
kern_umount(proc_mnt);
+ if (kmem_cache_destroy(proc_cachep))
+ printk(KERN_INFO "proc_inodes: not all structures were freed\n");
}

module_init(init_proc_fs)
Index: include/linux/fs.h
===================================================================
RCS file: /home/erik/cvsroot/elinux/include/linux/fs.h,v
retrieving revision 1.1.1.44
diff -d -u -r1.1.1.44 fs.h
--- include/linux/fs.h 2001/04/21 21:37:02 1.1.1.44
+++ include/linux/fs.h 2001/04/24 13:13:29
@@ -283,7 +283,6 @@
#include <linux/ntfs_fs_i.h>
#include <linux/msdos_fs_i.h>
#include <linux/umsdos_fs_i.h>
-#include <linux/iso_fs_i.h>
#include <linux/nfs_fs_i.h>
#include <linux/sysv_fs_i.h>
#include <linux/affs_fs_i.h>
@@ -300,7 +299,6 @@
#include <linux/bfs_fs_i.h>
#include <linux/udf_fs_i.h>
#include <linux/ncp_fs_i.h>
-#include <linux/proc_fs_i.h>
#include <linux/usbdev_fs_i.h>

/*
@@ -447,7 +445,6 @@
struct ntfs_inode_info ntfs_i;
struct msdos_inode_info msdos_i;
struct umsdos_inode_info umsdos_i;
- struct iso_inode_info isofs_i;
struct nfs_inode_info nfs_i;
struct sysv_inode_info sysv_i;
struct affs_inode_info affs_i;
@@ -464,7 +461,6 @@
struct bfs_inode_info bfs_i;
struct udf_inode_info udf_i;
struct ncp_inode_info ncpfs_i;
- struct proc_inode_info proc_i;
struct socket socket_i;
struct usbdev_inode_info usbdev_i;
void *generic_ip;
Index: include/linux/iso_fs.h
===================================================================
RCS file: /home/erik/cvsroot/elinux/include/linux/iso_fs.h,v
retrieving revision 1.1.1.9
diff -d -u -r1.1.1.9 iso_fs.h
--- include/linux/iso_fs.h 2001/01/02 15:20:04 1.1.1.9
+++ include/linux/iso_fs.h 2001/04/24 13:13:29
@@ -203,6 +203,11 @@
extern void leak_check_brelse(struct buffer_head * bh);
#endif /* LEAK_CHECK */

+static inline struct iso_inode_info *ISOFS_I(struct inode *inode)
+{
+ return (struct iso_inode_info *)inode->u.generic_ip;
+}
+
#endif /* __KERNEL__ */

#endif
Index: include/linux/proc_fs.h
===================================================================
RCS file: /home/erik/cvsroot/elinux/include/linux/proc_fs.h,v
retrieving revision 1.1.1.15
diff -d -u -r1.1.1.15 proc_fs.h
--- include/linux/proc_fs.h 2001/04/21 21:37:36 1.1.1.15
+++ include/linux/proc_fs.h 2001/04/24 13:13:29
@@ -83,6 +83,8 @@
extern struct proc_dir_entry *proc_root_driver;
extern struct proc_dir_entry *proc_root_kcore;

+extern kmem_cache_t *proc_cachep;
+
extern void proc_root_init(void);
extern void proc_misc_init(void);

@@ -162,6 +164,11 @@
static inline void proc_net_remove(const char *name)
{
remove_proc_entry(name,proc_net);
+}
+
+static inline struct proc_inode_info *PROCFS_I(struct inode * inode)
+{
+ return (struct proc_inode_info *)inode->u.generic_ip;
}

#else


--
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031, 2600 GA Delft, The Netherlands
Phone: +31-15-2783635 Fax: +31-15-2781843 Email: [email protected]
WWW: http://www-ict.its.tudelft.nl/~erik/

2001-04-24 13:36:19

by Christoph Rohland

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
> So yes, IMO having such patches available _is_ a good thing. And in
> 2.5 we definitely want them in the tree. If encapsulation part gets
> there during 2.4 and separate allocation is available for all of
> them it will be easier to do without PITA in process.

OK I will do that for tmpfs soon. And I will do the symlink inlining
with that patch.

Greetings
Christoph


2001-04-24 16:06:54

by Masaki Tsuji

[permalink] [raw]
Subject: Can't read SCSI TAPE

Dear sirs,

Although 'tar' can write to SCSI-TAPE, can't read from.
'tar' reports ....

......
-rw-r--r-- root/root xxxxx 2001-xx-xx 01:23 usr/bin/xxxxxx
tar: Skipping to next file header <------"A"
-rw-r--r-- root/root xxxxx 2001-xx-xx 01:23 usr/bin/xxxxxxx
......


"A" means written data is wrong, doesn't it???


Thanks for any help.

------------------------------------------
Detailed ->

System...
Kernel : 2.2.11 + raid0145-19990824-2.2.11.gz
or
2.2.11
tar : GNU tar 1.12
mt : mt-st v. 0.4
glibc2 : glibc-2.0.7pre6

Hardware...
Mother : Intel Celeron x2 (SMP)
TAPE drv : SONY SDT-9000
TAPE : DDS1 DDS2 DDS3
SCSI card: AHA-1542
Cable : SCSI-2 Hi-impeadance , length 0.5m
------------------------------------------

--
Masaki Tsuji

2001-04-24 16:40:49

by David L. Parsley

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Christoph Rohland wrote:
>
> OK I will do that for tmpfs soon. And I will do the symlink inlining
> with that patch.

Wow, this thread really exploded, eh? But thanks, Christoph, I look
forward to seeing
your patch. 4k symlinks really suck for embedders who never swap out
pages. ;-)

regards,
David

--
David L. Parsley
Network Administrator
Roanoke College

2001-04-24 18:39:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Al writes:
> Encapsulation part is definitely worth doing - it cleans the code up
> and doesn't change the result of compile. Adding allocation/freeing/
> cache initialization/cache removal and chaninging FOOFS_I() definition -
> well, it's probably worth to keep such patches around, but whether
> to switch any individual filesystem during 2.4 is a policy decision.
> Up to maintainer, indeed. Notice that these patches (separate allocation
> per se) are going to be within 3-4Kb per filesystem _and_ completely
> straightforward.

One thing to watch out for is that the current code zeros the u. struct
for us (as you pointed out to me previously), but allocating from the
slab cache will not... This could be an interesting source of bugs for
some filesystems that assume zero'd inode_info structs.

Fortunately, it doesn't appear that my patch to clean out all of the
"duplicate" zero initializers in the fs-specific code was accepted...

> What I would like to avoid is scenario like:
>
> Maintainers of filesystems with large private inodes: Why would we separate
> them? We would only waste memory, since the other filesystems stay in ->u
> and keep it large.

Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
approximate for 32-bit arches - I was just counting by hand and not
strictly checking alignment), then almost all other filesystems are below
25 * __u32 (i.e. half of the previous size).

For large-private-inode filesystems, we are wasting memory in EVERY inode
in the slab cache, not just ones in use with the large private inode. If
it were the most common filesystem (ext2, maybe reiser, msdos next) then
it wouldn't make much difference.

At some point reducing the union size is not efficient to have separate
slab allocations from a memory usage standpoint.

The remaining info structs are (approx. for 32-bit arch) (size in __u32):

ext2 27
affs 26
ufs 25
socket 24
shmem 22
coda 20
qnx4 18

minix 16
umsdos 15
hpfs 15
efs 14
sysv 13
reiser 12
udf 12
ntfs 11
ncp 10
msdos 9
adfs 7
smb 6
usbdev 5
proc 4
iso 4
bfs 3
romfs 2


> Maintainers of the rest of filesystems: Since there's no patches that would
> take large stuff out of ->u, why would we bother?

Maybe the size of the union can depend on CONFIG_*_FS? There should be
an absolute minimum size (16 * __u32 or so), but then people who want
reiserfs as their primary fs do not need to pay the memory penalty of ext2.
For ext2 (the next largest and most common fs), we could make it part of
the union if it is compiled in, and on a slab cache if it is a module?

Should uncommon-but-widely-used things like socket and shmem have their
own slab cache, or should they just allocate from the generic size-32 slab?

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-24 18:49:41

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Tue, 24 Apr 2001, Andreas Dilger wrote:

> One thing to watch out for is that the current code zeros the u. struct
> for us (as you pointed out to me previously), but allocating from the
> slab cache will not... This could be an interesting source of bugs for
> some filesystems that assume zero'd inode_info structs.

True, but easy to catch.

> Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
> approximate for 32-bit arches - I was just counting by hand and not
> strictly checking alignment), then almost all other filesystems are below
> 25 * __u32 (i.e. half of the previous size).

Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
I'm thinking about.

> Maybe the size of the union can depend on CONFIG_*_FS? There should be
> an absolute minimum size (16 * __u32 or so), but then people who want
> reiserfs as their primary fs do not need to pay the memory penalty of ext2.
> For ext2 (the next largest and most common fs), we could make it part of
> the union if it is compiled in, and on a slab cache if it is a module?

NO. Sorry about shouting, but that's the way to madness. I can understand
code depending on SMP vs. UP and similar beasts, but presense of specific
filesystems.... <shudder>

> Should uncommon-but-widely-used things like socket and shmem have their
> own slab cache, or should they just allocate from the generic size-32 slab?

That's pretty interesting - especially for sockets. I wonder whether
we would get problems with separate allocation of these - we don't
go from inode to socket all that often, but...

2001-04-24 18:49:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Eric Mouw writes:
> Al is right, it is no rocket science. Here is a patch against
> 2.4.4-pre6 for procfs and isofs. It took me an hour to do because I'm
> not familiar with the fs code. It compiles, and the procfs code even
> runs (sorry, no CDROM player availeble on my embedded StrongARM
> system), though it is possible that there are some bugs in it.

While I applaud your initiative, you made an unfortunate choice of
filesystems to convert. The iso_inode_info is only 4*__u32, as is
proc_inode_info. Given that we still need to keep a pointer to the
external info structs, and the overhead of the slab cache itself
(both CPU usage and memory overhead, however small), I don't think
it is worthwhile to have isofs and procfs in separate slabs.

On the other hand, sockets and shmem are both relatively large...
Watch out that the *_inode_info structs have all of the fields
initialized, because the union field is zeroed for us, but slab is not.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-24 18:53:01

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Tue, 24 Apr 2001, Andreas Dilger wrote:

> While I applaud your initiative, you made an unfortunate choice of
> filesystems to convert. The iso_inode_info is only 4*__u32, as is
> proc_inode_info. Given that we still need to keep a pointer to the
> external info structs, and the overhead of the slab cache itself
> (both CPU usage and memory overhead, however small), I don't think
> it is worthwhile to have isofs and procfs in separate slabs.
>
> On the other hand, sockets and shmem are both relatively large...
> Watch out that the *_inode_info structs have all of the fields
> initialized, because the union field is zeroed for us, but slab is not.

Frankly, I'd rather start with encapsulation part. It's easy to
verify, it can go in right now and it makes separate allocation
part uncluttered. Besides, it simply makes code cleaner, so it
makes sense even if don't want to go for separate allocation for
that particular fs.

2001-04-24 19:13:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Al writes:
> > Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
> > approximate for 32-bit arches - I was just counting by hand and not
> > strictly checking alignment), then almost all other filesystems are below
> > 25 * __u32 (i.e. half of the previous size).
>
> Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
> I'm thinking about.

But then again, you are saving 50-25 words for every non-NFS inode, and I
think _most_ systems will have more local inodes than NFS inodes. Even
NFS servers will have local inodes, only clients (AFAIK) use nfs_inode_info.

> > Maybe the size of the union can depend on CONFIG_*_FS? There should be
> > an absolute minimum size (16 * __u32 or so), but then people who want
> > reiserfs as their primary fs do not need to pay the memory penalty of ext2.
> > For ext2 (the next largest and most common fs), we could make it part of
> > the union if it is compiled in, and on a slab cache if it is a module?
>
> NO. Sorry about shouting, but that's the way to madness. I can understand
> code depending on SMP vs. UP and similar beasts, but presense of specific
> filesystems.... <shudder>

But then again, if the size of nfs_inode_info changes, it is the same
problem... sizeof(struct inode) may have changed (depends if slab has
some padding between inodes or not). If we stick to a minimum size
(16 words or maybe even 8), then it will never change anymore, and we
do not have overhead for small inode_info structs.

> > Should uncommon-but-widely-used things like socket and shmem have their
> > own slab cache, or should they just allocate from the generic size-32 slab?
>
> That's pretty interesting - especially for sockets. I wonder whether
> we would get problems with separate allocation of these - we don't
> go from inode to socket all that often, but...

I never thought of that. I guess the socket code does not know which
fs the inode_info was allocated from, so it cannot free it from the slab
(even if it had access to these slabs, which it does not). In that case,
each fs would have struct socket as the minimum size allocatable, which
is unfortunately one of the largest inode_info sizes. It is smaller
than ext2, but...

Any ideas? Do we ever get back into fs-specific clear_inode() from
a socket? In that case, the socket would just hold a pointer to the
fs-specific inode_info inside its own struct socket until the inode
is dropped.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-24 20:46:33

by Erik Mouw

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Tue, Apr 24, 2001 at 12:47:38PM -0600, Andreas Dilger wrote:
> While I applaud your initiative, you made an unfortunate choice of
> filesystems to convert. The iso_inode_info is only 4*__u32, as is
> proc_inode_info. Given that we still need to keep a pointer to the
> external info structs, and the overhead of the slab cache itself
> (both CPU usage and memory overhead, however small), I don't think
> it is worthwhile to have isofs and procfs in separate slabs.

Well, I know a little bit about procfs because I'm currently
documenting it, so that's why I picked it first. After I got the idea,
isofs was quite easy.

In retrospect it would have been more effective to pick a filesystem
with a larger *_inode_info field, but then again: Al is right. Struct
inode is cluttered with *_inode_info fields, while we use anonymous
data entries in other parts of the kernel (like the data pointer in
struct proc_dir_entry, or the priv pointer in struct net_device).

There is another advantage: suppose you're hacking on a filesystem and
change it's *_fs_i.h header. With Al's proposal you only have to
recompile the filesystem you're hacking on, while you have to recompile
the complete kernel in the current situation.


Erik

--
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031, 2600 GA Delft, The Netherlands
Phone: +31-15-2783635 Fax: +31-15-2781843 Email: [email protected]
WWW: http://www-ict.its.tudelft.nl/~erik/

2001-04-24 22:02:15

by Ingo Oeser

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

On Tue, Apr 24, 2001 at 02:49:23PM -0400, Alexander Viro wrote:
> On Tue, 24 Apr 2001, Andreas Dilger wrote:
> > One thing to watch out for is that the current code zeros the u. struct
> > for us (as you pointed out to me previously), but allocating from the
> > slab cache will not... This could be an interesting source of bugs for
> > some filesystems that assume zero'd inode_info structs.
> True, but easy to catch.

Jepp. Just request SLAB_ZERO (still to be implemented) instead of
SLAB_POISON or provide an constructor.

A nice set of macros for this would make it quite easy. The ctor
is the way to handle it. May be we could even put all the fs
specific initalizers into it (e.g. magics, zeroes).

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-24 22:00:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

>>>>> " " == Alexander Viro <[email protected]> writes:

> On Mon, 23 Apr 2001, Jan Harkes wrote:

>> On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

>> > BTW: Is it still less than one page? Then it doesn't make me
>> > nervous. Why? Guess what granularity we allocate at, if we
>> > just store pointers instead of the inode.u. Or do you like
>> > every FS creating his own slab cache?

> Oh, for crying out loud. All it takes is half an hour per
> filesystem. Here - completely untested patch that does it for
> NFS. Took about that long. Absolutely straightforward, very
> easy to verify correctness.

> Some stuff may need tweaking, but not much (e.g. some functions
> should take nfs_inode_info instead of inodes, etc.). From the
> look of flushd cache it seems that we would be better off with
> cyclic lists instead of single-linked ones for the hash, but I
> didn't look deep enough.

> So consider the patch below as proof-of-concept. Enjoy:

Hi Al,

I believe your patch introduces a race for the NFS case. The problem
lies in the fact that nfs_find_actor() needs to read several of the
fields from nfs_inode_info. By adding an allocation after the inode
has been hashed, you are creating a window during which the inode can
be found by find_inode(), but during which you aren't even guaranteed
that the nfs_inode_info exists let alone that it's been initialized
by nfs_fill_inode().

One solution could be to have find_inode sleep on encountering a
locked inode. It would have to be something along the lines of

static struct inode * find_inode(struct super_block * sb, unsigned long ino, struct list_head *head, find_inode_t find_actor, void *opaque)
{
struct list_head *tmp;
struct inode * inode;

tmp = head;
for (;;) {
tmp = tmp->next;
inode = NULL;
if (tmp == head)
break;
inode = list_entry(tmp, struct inode, i_hash);
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
if (find_actor) {
if (inode->i_state & I_LOCK) {
spin_unlock(&inode_lock);
__wait_on_inode(inode);
spin_lock(&inode_lock);
tmp = head;
continue;
}
if (!find_actor(inode, ino, opaque))
continue;
}
break;
}
return inode;
}


Cheers,
Trond

2001-04-24 22:10:05

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On 24 Apr 2001, Trond Myklebust wrote:

> Hi Al,
>
> I believe your patch introduces a race for the NFS case. The problem
> lies in the fact that nfs_find_actor() needs to read several of the
> fields from nfs_inode_info. By adding an allocation after the inode
> has been hashed, you are creating a window during which the inode can
> be found by find_inode(), but during which you aren't even guaranteed
> that the nfs_inode_info exists let alone that it's been initialized
> by nfs_fill_inode().

_Ouch_. So what are you going to do if another iget4() comes between
the moment when you hash the inode and set these fields? You are
filling them only after you drop inode_lock, so AFAICS the current
code has the same problem.

2001-04-24 22:32:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

>>>>> " " == Alexander Viro <[email protected]> writes:

> _Ouch_. So what are you going to do if another iget4() comes
> between the moment when you hash the inode and set these
> fields? You are filling them only after you drop inode_lock, so
> AFAICS the current code has the same problem.

The entire call to iget4() is protected by the BKL in all relevant
instances. As long as we don't sleep between find_inode() and
nfs_fill_inode(), we're safe.

In fact the BKL protection is needed also for another reason: we don't
actually initialize the inode in the I_LOCK-protected read_inode() but
instead rely on the caller of iget4 to do it for us. The reason is
that one we would need to pass the struct nfs_fattr to read_inode()
and this wasn't possible until the ReiserFS people introduced
read_inode2().

Cheers,
Trond

2001-04-25 07:29:39

by Christoph Rohland

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Hi Andreas,

On Tue, 24 Apr 2001, Andreas Dilger wrote:
> On the other hand, sockets and shmem are both relatively large...

shmem is only large because the union is large. I introduced the
direct swap array of size SHMEM_NR_DIRECT simply to take advantage of
the union. We can decrease SHMEM_NR_DIRECT very easily. I am thinking
about 1 or 5 which would mean that we allocate an indirect block for
files bigger than 4k or 20k respectively.

The shmem_inode_info would then be 8 or 12 words.

Greetings
Christoph


2001-04-25 18:47:29

by Alexander Viro

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?



On Wed, 25 Apr 2001, Andreas Dilger wrote:

> Al writes:
> > It's not a fscking rocket science - encapsulate accesses to ->u.foofs_i
> > into inlined function, find ->read_inode, find places that do get_empty_inode
>
> OK, I was doing this for the ext3 port I'm working on for 2.4, and ran into
> a snag. In the ext3_inode_info, there is a list_head. However, if this is
> moved into a separate slab struct, it is now impossible to locate the inode
> from the offset in the slab struct. When I was checking the size of each
> inode_info struct, I noticed several others that had list_heads in them.
> One solution is that we store list_heads in the inode proper, after generic_ip.

If you need to go from ext3_inode_info to inode - put the pointer into the
thing and be done with that. No need to bump ->i_count - fs-private
part dies before inode itself.

2001-04-26 21:49:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?

Al writes:
> It's not a fscking rocket science - encapsulate accesses to ->u.foofs_i
> into inlined function, find ->read_inode, find places that do get_empty_inode

OK, I was doing this for the ext3 port I'm working on for 2.4, and ran into
a snag. In the ext3_inode_info, there is a list_head. However, if this is
moved into a separate slab struct, it is now impossible to locate the inode
from the offset in the slab struct. When I was checking the size of each
inode_info struct, I noticed several others that had list_heads in them.
One solution is that we store list_heads in the inode proper, after generic_ip.

Cheers, Andreas
--
Andreas Dilger TurboLabs filesystem development
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-04-24 10:21:29

by David Woodhouse

[permalink] [raw]
Subject: Re: hundreds of mount --bind mountpoints?


[email protected] said:
> <tone polite> What's stopping you? </tone> You _are_ JFFS maintainer,
> aren't you?

It already uses...

#define JFFS2_INODE_INFO(i) (&i->u.jffs2_i)

It's trivial to switch over when the size of the inode union goes below the
size of struct jffs2_inode_info. Until then, I'd just be wasting space.

--
dwmw2