2000-12-28 20:56:08

by Linus Torvalds

[permalink] [raw]
Subject: test13-pre5


The main notables are the network fixes (uninitialized skb->dev could and
did cause oopses in ip_defrag) and the mm fixes (dirty pages without
mappings etc, causing problems in page_launder).

The mm cleanups also include removing "swapout()" as a VM operation, as
nobody can sanely do anything more than just marking the page dirty anyway
(the real work is done by writepage() these days), and doing that
explicitly simplifies VM scanning considerably.

This still doesn't tell "sync()" about dirty pages (ie the "innd loses the
active file after a reboot" bug), but now the places that mark pages dirty
are under control. Next step..

Linus

-----

- pre5:
- NIIBE Yutaka: SuperH update
- Geert Uytterhoeven: m68k update
- David Miller: TCP RTO calc fix, UDP multicast fix etc
- Duncan Laurie: ServerWorks PIRQ routing definition.
- mm PageDirty cleanups, added sanity checks, and don't lose the bit.

- pre4:
- Christoph Rohland: shmfs cleanup
- Nicolas Pitre: don't forget loop.c flags
- Geert Uytterhoeven: new-style m68k Makefiles
- Neil Brown: knfsd cleanups, raid5 re-org
- Andrea Arkangeli: update to LVM-0.9
- LC Chang: sis900 driver doc update
- David Miller: netfilter oops fix
- Andrew Grover: acpi update

- pre3:
- Christian Jullien: smc9194: proper dev_kfree_skb_irq
- Cort Dougan: new-style PowerPC Makefiles
- Andrew Morton, Petr Vandrovec: fix run_task_queue
- Christoph Rohland: shmfs for shared memory handling

- pre2:
- Kai Germaschewski: ISDN update (including Makefiles)
- Jens Axboe: cdrom updates
- Petr Vandrovec; Matrox G450 support
- Bill Nottingham: fix FAT32 filesystems on 64-bit platforms
- David Miller: sparc (and other) Makefile fixup
- Andrea Arkangeli: alpha SMP TLB context fix (and cleanups)
- Niels Kristian Bech Jensen: checkconfig, USB warnings
- Andrew Grover: large ACPI update

- pre1:
- me: drop support for old-style Makefiles entirely. Big.
- me: check b_end_io at the IO submission path
- me: fix "ptep_mkdirty()" (so that swapoff() works correctly)
- fix fault case in copy_from_user() with a constant size, where
((size & 3) == 3)



2000-12-28 21:02:38

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: test13-pre5


On Thu, 28 Dec 2000, Linus Torvalds wrote:

> This still doesn't tell "sync()" about dirty pages (ie the "innd loses the
> active file after a reboot" bug), but now the places that mark pages dirty
> are under control. Next step..

Do you really want to split the per-address-space pages list in dirty and
clean lists for 2.4 ?

Or do you think walking the current per-address-space page list searching
for dirty pages and syncing them is ok?

2000-12-28 21:30:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5


On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
>
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
>
> > This still doesn't tell "sync()" about dirty pages (ie the "innd loses the
> > active file after a reboot" bug), but now the places that mark pages dirty
> > are under control. Next step..
>
> Do you really want to split the per-address-space pages list in dirty and
> clean lists for 2.4 ?
>
> Or do you think walking the current per-address-space page list searching
> for dirty pages and syncing them is ok?

There are a few issues:

- fdatasync/fsync is often quite critical for databases. It's _possibly_
ok to just walk all the pages for an inode, but I'm fairly certain that
this is an area where if we don't have a separate dirty queue we _will_
need to add one later.

- global dirty list for global syn(). We don't have one, and I don't
think we want one. We could add a few lists, and split up the active
list into "active" and "active_dirty", for example, but I don't like
the implications that would probably have for the LRU ordering.

- we absolutely do _not_ want to make "struct page" bigger. We can't
afford to just throw away another 8 bytes per page on adding a new list
structure, I feel. Even if this would be the simplest solution.

So right now I think the right idea is to

- split up "address_space->pages" into "->clean_pages" and
"->dirty_pages". This is fairly easily done, it requires small changes
like making "truncate_inode_pages()" instead be
"truncate_list_pages()", and making "truncate_inode_pages()" call that
for both the dirty and the clean lists. That's about 10 lines of diff
(I already tried this), and that's about the biggest example of
something like that. Most other areas don't much care about the inode
page lists.

- make "SetPageDirty()" do something like

if (!test_and_set(PG_dirty, &page->flags)) {
spin_lock(&page_cache_lock);
list_del(page->list);
list_add(page->list, page->mapping->dirty_pages);
spin_unlock(&page_cache_lock);
}

This will require making sure that every place that does a
SetPageDirty() will be ok with this (ie double-check that they all have
a mapping: right now the free_pte() code in mm/memory.c doesn't care,
because it knew that it coul dmark even anonymous pages dirty and
they'd just get ignored.

- make a filemap_fdatasync() that walks the dirty pages and does a
writepage() on them all and moves them to the clean list.

- make fsync() and fdatasync() call the above function before they even
call the low-level per-FS code.

- make sync_inodes() use that same filemap_fdatasync() function so that
the sync() case is handled.

All done. It looks something like 5-10 places, most of which are about 10
lines of diff each, if even that.

The only real worry would be that the locking isn't rigth, but getting the
pagemap lock should be the safe thing, and from a lock contention
standpoint it should be ok (if we move a lot of pages back and forth, lock
contention is going to be the least of our worries, because it implies
that we'd be doing a LOT of IO to actually write the pages out).

If somebody (you? hint, hint) wants to do this, I'd be very happy - I can
do it myself, but because it's my birthday I'm supposed to drag myself off
the computer soon and be social, or Tove will be grumpy.

And you don't want Tove grumpy.

Linus

2000-12-28 21:45:55

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: test13-pre5


On Thu, 28 Dec 2000, Linus Torvalds wrote:

> If somebody (you? hint, hint) wants to do this,

Ok, I'll do it because I love Tove.

> I'd be very happy - I can do it myself, but because it's my birthday
> I'm supposed to drag myself off the computer soon and be social, or
> Tove will be grumpy.

2000-12-28 21:52:15

by Rik van Riel

[permalink] [raw]
Subject: [wildly off-topic] Re: test13-pre5

On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
>
> > If somebody (you? hint, hint) wants to do this,
>
> Ok, I'll do it because I love Tove.

Marcelo, you should buy some glasses ;)

Tove != Tux

It's ok and probably safe to love Tux, the nice cuddly
penguin everybody loves.

However, loving the (6-time ??) Finnish female karate
champion, who happens to be married to Linus is probably
quite a bit less safe ...

cheers,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-12-28 21:54:55

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [wildly off-topic] Re: test13-pre5


On Thu, 28 Dec 2000, Rik van Riel wrote:

> On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
> > On Thu, 28 Dec 2000, Linus Torvalds wrote:
> >
> > > If somebody (you? hint, hint) wants to do this,
> >
> > Ok, I'll do it because I love Tove.
>
> Marcelo, you should buy some glasses ;)
>
> Tove != Tux
>
> It's ok and probably safe to love Tux, the nice cuddly
> penguin everybody loves.
>
> However, loving the (6-time ??) Finnish female karate
> champion, who happens to be married to Linus is probably
> quite a bit less safe ...

Marcelo runs like hell.

2000-12-28 21:58:36

by Jonathan Hudson

[permalink] [raw]
Subject: Re: test13-pre5 (via82cxxx_audio.c)


In article <[email protected]>,
Linus Torvalds <[email protected]> writes:
LT>
LT> The mm cleanups also include removing "swapout()" as a VM operation, as

swapout was not removed from drivers/sound/via82cxxx_audio.c; the
following does so (compiles and produces sound, someone who
understands this please check).


--- drivers/sound/via82cxxx_audio.c.orig Thu Dec 28 21:02:03 2000
+++ drivers/sound/via82cxxx_audio.c Thu Dec 28 21:12:58 2000
@@ -1727,20 +1727,8 @@
}


-#ifndef VM_RESERVE
-static int via_mm_swapout (struct page *page, struct file *filp)
-{
- return 0;
-}
-#endif /* VM_RESERVE */
-
-
struct vm_operations_struct via_mm_ops = {
nopage: via_mm_nopage,
-
-#ifndef VM_RESERVE
- swapout: via_mm_swapout,
-#endif
};




2000-12-28 22:12:37

by Daniel Phillips

[permalink] [raw]
Subject: Re: test13-pre5

Linus Torvalds wrote:
> - global dirty list for global syn(). We don't have one, and I don't
> think we want one. We could add a few lists, and split up the active
> list into "active" and "active_dirty", for example, but I don't like
> the implications that would probably have for the LRU ordering.

This has been the subject of a lot of flam^H^H^H^H discussion on
#kernelnewbies about this and it's still an open question. The only way
to see if a separate active_dirty hurts or helps is to try it. Later.
:-)

I don't see how a separate active_dirty list can hurt LRU ordering. We
could still take the pages off the two lists in the same order we did
with one list if we wanted to, or at least, statistically the same in
turns of number, age, time since entering the list, etc. That better
not cause radically different or undesireable behaviour or something is
really broken.

By breaking active into two lists we'd get a very interesting tuning
parameter to play with: the relative rate at which pages are moved from
active to inactive. Beyond that, the active_dirty list could be pressed
into service quite easily as a page-oriented version of kflushd, and
would obviously be useful as a global sync list.

--
Daniel

2000-12-28 22:48:11

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 12:59:22PM -0800, Linus Torvalds wrote:
> - we absolutely do _not_ want to make "struct page" bigger. We can't
> afford to just throw away another 8 bytes per page on adding a new list
> structure, I feel. Even if this would be the simplest solution.

BTW..

The current 2.4 struct page could be already shortened a lot, saving a lot
of cache.
(first number for 32bit, second for 64bit)

- Do not compile virtual in when the kernel does not support highmem
(saves 4/8 bytes)
- Instead of having a zone pointer mask use a 8 or 16 byte index into a
zone table. On a modern CPU it is much cheaper to do the and/shifts than
to do even a single cache miss during page aging. On a lot of systems
that zone index could be hardcoded to 0 anyways, giving better code.
- Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
has the same swapping algorithm even only uses 8bit)
- Remove the waitqueue debugging (obvious @)
- flags can be __u32 on 64bit hosts, sharing 64bit with something that
is tolerant to async updates (e.g. the zone table index or the index)
- index could be probably u32 instead of unsigned long, saving 4 bytes
on i386

Would you consider patches for any of these points?


-Andi


2000-12-28 23:20:48

by David Miller

[permalink] [raw]
Subject: Re: test13-pre5

Date: Thu, 28 Dec 2000 23:17:22 +0100
From: Andi Kleen <[email protected]>

Would you consider patches for any of these points?

To me it seems just as important to make sure struct page is
a power of 2 in size, with the waitq debugging turned off this
is true for both 32-bit and 64-bit hosts last time I checked.

Later,
David S. Miller
[email protected]

2000-12-28 23:29:29

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 02:33:07PM -0800, David S. Miller wrote:
> Date: Thu, 28 Dec 2000 23:17:22 +0100
> From: Andi Kleen <[email protected]>
>
> Would you consider patches for any of these points?
>
> To me it seems just as important to make sure struct page is
> a power of 2 in size, with the waitq debugging turned off this
> is true for both 32-bit and 64-bit hosts last time I checked.

Why exactly a power of two ? To get rid of ->index ?


-Andi

2000-12-28 23:35:08

by Rik van Riel

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, 28 Dec 2000, Andi Kleen wrote:
> On Thu, Dec 28, 2000 at 02:33:07PM -0800, David S. Miller wrote:
> > Date: Thu, 28 Dec 2000 23:17:22 +0100
> > From: Andi Kleen <[email protected]>
> >
> > Would you consider patches for any of these points?
> >
> > To me it seems just as important to make sure struct page is
> > a power of 2 in size, with the waitq debugging turned off this
> > is true for both 32-bit and 64-bit hosts last time I checked.
>
> Why exactly a power of two ? To get rid of ->index ?

Most likely to minimise the number of cache misses needed
to access a complete page_struct.

Then again, I guess 48 bytes would _also_ guarantee that
we never need more than 2 cache misses to access every
part of the page_struct.

And the memory wasted in the page_struct may well be a
bigger factor than the cache misses on lots of systems...

(time for another CONFIG option? ;))

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-12-28 23:42:18

by David Miller

[permalink] [raw]
Subject: Re: test13-pre5

Date: Thu, 28 Dec 2000 23:58:36 +0100
From: Andi Kleen <[email protected]>

Why exactly a power of two ? To get rid of ->index ?

To make things like "page - mem_map" et al. use shifts instead of
expensive multiplies...

Later,
David S. Miller
[email protected]

2000-12-28 23:46:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5



On Thu, 28 Dec 2000, Andi Kleen wrote:
>
> BTW..
>
> The current 2.4 struct page could be already shortened a lot, saving a lot
> of cache.

Not that much, but some.

> (first number for 32bit, second for 64bit)
>
> - Do not compile virtual in when the kernel does not support highmem
> (saves 4/8 bytes)

Even on UP, "virtual" often helps. The conversion from "struct page" to
the linear address is quite common, and if "struct page" isn't a
power-of-two it gets slow.

> - Instead of having a zone pointer mask use a 8 or 16 byte index into a
> zone table. On a modern CPU it is much cheaper to do the and/shifts than
> to do even a single cache miss during page aging. On a lot of systems
> that zone index could be hardcoded to 0 anyways, giving better code.
> - Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
> has the same swapping algorithm even only uses 8bit)

This would be good, but can be hard.

FreeBSD doesn't try to be portable any more, but Linux does, and there are
architectures where 8- and 16-bit accesses aren't atomic but have to be
done with read-modify-write cycles.

And even for fields like "age", where we don't care whether the age itself
is 100% accurate, we _do_ care that the fields close-by don't get strange
effects from updating "age". We used to have exactly this problem on alpha
back in the 2.1.x timeframe.

This is why a lot of fields are 32-bit, even though we wouldn't need more
than 8 or 16 bits of them.

> - Remove the waitqueue debugging (obvious @)

Not obvious enough. There are magic things that could be done, like hiding
the wait-queue lock bit in the waitqueue lists themselves etc. That could
be done with some per-architecture magic etc.

> - flags can be __u32 on 64bit hosts, sharing 64bit with something that
> is tolerant to async updates (e.g. the zone table index or the index)
> - index could be probably u32 instead of unsigned long, saving 4 bytes
> on i386

It already _is_ 32-bit on x86.

Only the LSF patches made it 64-bit. That never made it into the standard
kernel.

Sure, we could make it "u32" and thus force it to be 32-bit even on 64-bit
architectures, but some day somebody will want to have more than 46 bits
of file mappings, and which 46 bits is _huge_ on a 32-bit machine, on a
64-bit one in 5 years it will not be entirely unreasonable.

Anyway, I don't want to increase "struct page" in size, but I also don't
think it's worth micro-optimizing some of these things if the code gets
harder to maintain (like the partial-word stuff would be).

The biggest win by far would come from increasing the page-size, something
we can do even in software. Having a "kernel page size" of 8kB even on x86
would basically cut the overhead in half. As that would also improve some
other things (like having better throughput due to bigger contiguous
chunks), that's something I'd like to see some day.

(And user space wouldn't ever have to know - we could map in "half pages"
aka "hardware pages" without mappign the whole page).

Linus

2000-12-28 23:48:18

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
> Date: Thu, 28 Dec 2000 23:58:36 +0100
> From: Andi Kleen <[email protected]>
>
> Why exactly a power of two ? To get rid of ->index ?
>
> To make things like "page - mem_map" et al. use shifts instead of
> expensive multiplies...

I thought that is what ->index is for ?

Also gcc seems to be already quite clever at dividing through small
integers, e.g. using mul and shift and sub, so it may not be even worth to reach
for a real power-of-two.

I suspect doing the arithmetics is at least faster than eating the
cache misses because of ->index.

-Andikkk


2000-12-28 23:56:10

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 03:15:01PM -0800, Linus Torvalds wrote:
> > (first number for 32bit, second for 64bit)
> >
> > - Do not compile virtual in when the kernel does not support highmem
> > (saves 4/8 bytes)
>
> Even on UP, "virtual" often helps. The conversion from "struct page" to
> the linear address is quite common, and if "struct page" isn't a
> power-of-two it gets slow.

Are you sure? Last time I checked gcc did a very good job at optimizing
small divisions with small integers, without using div. It just has to
be a good integer with not too many set bits.

> is 100% accurate, we _do_ care that the fields close-by don't get strange
> effects from updating "age". We used to have exactly this problem on alpha
> back in the 2.1.x timeframe.

When it is shared with a constant field (like zone index) it shouldn't
matter, no ? At worst you can see outdated data, and when the outdated
data is constant all is fine.

> > - flags can be __u32 on 64bit hosts, sharing 64bit with something that
> > is tolerant to async updates (e.g. the zone table index or the index)
> > - index could be probably u32 instead of unsigned long, saving 4 bytes
> > on i386
>
> It already _is_ 32-bit on x86.

Oops. It was a typo. I meant to write "saving 4 bytes on 64bit"

> Anyway, I don't want to increase "struct page" in size, but I also don't
> think it's worth micro-optimizing some of these things if the code gets
> harder to maintain (like the partial-word stuff would be).

Ok pity :-/

Hopefully all the "goto out" micro optimizations can be taken out then too,
I recently found out that gcc 2.97's block moving pass has the tendency
to move the outlined blocks inline again ;)



-Andi

2000-12-28 23:56:50

by Rik van Riel

[permalink] [raw]
Subject: Re: test13-pre5

On Fri, 29 Dec 2000, Andi Kleen wrote:
> On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
> > Date: Thu, 28 Dec 2000 23:58:36 +0100
> > From: Andi Kleen <[email protected]>
> >
> > Why exactly a power of two ? To get rid of ->index ?
> >
> > To make things like "page - mem_map" et al. use shifts instead of
> > expensive multiplies...
>
> I thought that is what ->index is for ?

Nope, ->index is there to identify which offset the page
has in ->mapping, read mm/filemap.c::__find_page_nolock()
for more info.

> Also gcc seems to be already quite clever at dividing through
> small integers, e.g. using mul and shift and sub, so it may not
> be even worth to reach for a real power-of-two.
>
> I suspect doing the arithmetics is at least faster than eating the
> cache misses because of ->index.

I'm pretty confident that arithmetic is faster than cache
misses ... but an unlucky size of the page struct will cause
extra cache misses due to misalignment.

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-12-29 00:02:31

by David Miller

[permalink] [raw]
Subject: Re: test13-pre5

Date: Fri, 29 Dec 2000 00:17:21 +0100
From: Andi Kleen <[email protected]>

On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
> To make things like "page - mem_map" et al. use shifts instead of
> expensive multiplies...

I thought that is what ->index is for ?

It is for the page cache identity Andi... you know, page_hash(mapping, index)...

And the add/sub/shift expansion of a multiply/divide by constant even
in its' most optimal form is often not trivial, it is something on the
order of 7 instructions with waitq debugging enabled last time I
checked.

Later,
David S. Miller
[email protected]

2000-12-29 00:06:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5



On Fri, 29 Dec 2000, Andi Kleen wrote:

> On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
> > Date: Thu, 28 Dec 2000 23:58:36 +0100
> > From: Andi Kleen <[email protected]>
> >
> > Why exactly a power of two ? To get rid of ->index ?
> >
> > To make things like "page - mem_map" et al. use shifts instead of
> > expensive multiplies...
>
> I thought that is what ->index is for ?

No. "index" only gives the virtual index.

"page - mem_map" is how you get the _physical_ index in the zone in
question, which is common for physical tranlations (ie "pte_page()",
"page_to_virt()" or "page_to_phys()")

> Also gcc seems to be already quite clever at dividing through small
> integers, e.g. using mul and shift and sub, so it may not be even worth to reach
> for a real power-of-two.

Look at the code - it's a big multiply to do a divide by 68 or similar.
Quite expensive.

Doing "page->address - TASK_SIZE" on x86 for the non-highmem case would
probably be faster.

Linus

2000-12-29 00:08:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5



On Fri, 29 Dec 2000, Andi Kleen wrote:
>
> Hopefully all the "goto out" micro optimizations can be taken out then too,

"goto out" often generates much more readable code, so the optimization is
secondary.

> I recently found out that gcc 2.97's block moving pass has the tendency
> to move the outlined blocks inline again ;)

Too bad. Maybe somebody should tell gcc maintainers about programmers that
know more than the compiler again.

Linus

2000-12-29 00:10:44

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 03:14:56PM -0800, David S. Miller wrote:
> Date: Fri, 29 Dec 2000 00:17:21 +0100
> From: Andi Kleen <[email protected]>
>
> On Thu, Dec 28, 2000 at 02:54:52PM -0800, David S. Miller wrote:
> > To make things like "page - mem_map" et al. use shifts instead of
> > expensive multiplies...
>
> I thought that is what ->index is for ?
>
> It is for the page cache identity Andi... you know, page_hash(mapping, index)...

Oops, I confused it with the 2.0 page->map_nr, which did exactly that.

I should have known better. Thanks for correcting this brainfart.

> And the add/sub/shift expansion of a multiply/divide by constant even
> in its' most optimal form is often not trivial, it is something on the
> order of 7 instructions with waitq debugging enabled last time I
> checked.

Wonder if it looks better with wq debugging turned off or a compressed
->zone.


-Andi

2000-12-29 00:14:34

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:
>
>
> On Fri, 29 Dec 2000, Andi Kleen wrote:
> >
> > Hopefully all the "goto out" micro optimizations can be taken out then too,
>
> "goto out" often generates much more readable code, so the optimization is
> secondary.

I was more thinking of cases like the scheduler's gotos, which has gotten
rather spagetti recently. Admittedly classic goto out is often more readable
than many nested if()s with error handling.

>
> > I recently found out that gcc 2.97's block moving pass has the tendency
> > to move the outlined blocks inline again ;)
>
> Too bad. Maybe somebody should tell gcc maintainers about programmers that
> know more than the compiler again.

In x86-64 which relies on 2.97 I'm using __builtin_expect, defined to
likely() and unlikely(), which seems to generate good code.


-Andi

2000-12-29 00:37:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: test13-pre5



On Thu, 28 Dec 2000, Linus Torvalds wrote:

> - make "SetPageDirty()" do something like
>
> if (!test_and_set(PG_dirty, &page->flags)) {
> spin_lock(&page_cache_lock);
> list_del(page->list);
> list_add(page->list, page->mapping->dirty_pages);
> spin_unlock(&page_cache_lock);
> }
>
> This will require making sure that every place that does a
> SetPageDirty() will be ok with this (ie double-check that they all have
> a mapping: right now the free_pte() code in mm/memory.c doesn't care,
> because it knew that it coul dmark even anonymous pages dirty and
> they'd just get ignored.
> - make a filemap_fdatasync() that walks the dirty pages and does a
> writepage() on them all and moves them to the clean list.

We also want to move the page to the per-address-space clean list in
ClearPageDirty I suppose.

2000-12-29 01:21:09

by Stefan Traby

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:

> Too bad. Maybe somebody should tell gcc maintainers about programmers that
> know more than the compiler again.

I know that {p,}gcc-2.95.2{,.1} are not officially supported.

Did you know that it's impossible to compile nfsv4 because of
register allocation problems with long long since (long long) month ?

The following does not hurt, it's just a fix for a broken
compiler:

--- linux/fs/lockd/xdr4.c.orig Fri Dec 29 01:35:32 2000
+++ linux/fs/lockd/xdr4.c Fri Dec 29 01:36:36 2000
@@ -156,7 +156,7 @@
nlm4_encode_lock(u32 *p, struct nlm_lock *lock)
{
struct file_lock *fl = &lock->fl;
- __s64 start, len;
+ volatile __s64 start, len;

if (!(p = xdr_encode_string(p, lock->caller))
|| !(p = nlm4_encode_fh(p, &lock->fh))


Here is an example without this patch (pgcc-2.95.2.1 this time which
is bug-compatible to gcc-2.95.2.1).

gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe -mpreferred-stack-boundary=2 -march=i686 -DMODULE -c -o xdr4.o xdr4.c
xdr4.c: In function `nlm4_encode_lock':
xdr4.c:181: internal error--insn does not satisfy its constraints:
(insn/i 313 585 315 (set (reg:SI 1 %edx)
(subreg:SI (lshiftrt:DI (reg:DI 0 %eax)
(const_int 32 [0x20])) 0)) 323 {lshrdi3_const_int_subreg} (nil)
(nil))
gcc: Internal compiler error: program cpp got fatal signal 13
make[2]: *** [xdr4.o] Error 1
make[2]: Leaving directory `/.localvol000/usr/src/linux-2.4.0-test13pre5/fs/lockd'
make[1]: *** [_modsubdir_lockd] Error 2
make[1]: Leaving directory `/.localvol000/usr/src/linux-2.4.0-test13pre5/fs'
make: *** [_mod_fs] Error 2

The question is: Is it worth to apply ?

--

ciao -
Stefan

" export PS1="((((((((((((rms))))))))))))# " "

Stefan Traby Linux/ia32 fax: +43-3133-6107-9
Mitterlasznitzstr. 13 Linux/alpha phone: +43-3133-6107-2
8302 Nestelbach Linux/sparc http://www.hello-penguin.com
Austria mailto://[email protected]
Europe mailto://[email protected]

2000-12-29 01:42:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5



On Thu, 28 Dec 2000, Marcelo Tosatti wrote:
>
> We also want to move the page to the per-address-space clean list in
> ClearPageDirty I suppose.

I would actually advice against this.

- it's ok to have too many pages on the dirty list (think o fthe dirty
list as a "these pages _can_ be dirty")

- whenever we do a ClearPageDirty() we're likely to remove the page from
the lists altogether, so it's not worth it doing extra work.

The exception, of course, is the actual "filemap_fdatasync()" function,
but that one would probably look something like

spin_lock(&page_cache_lock);
while (!list_empty(&mapping->dirty_pages)) {
struct page *page = list_entry(mapping->dirty_pages.next, struct page, list);

list_del(&page->list);
list_add(&page->list, &mapping->clean_pages);

if (!PageDirty(page))
continue;
page_get(page);
spin_unlock(&page_cache_lock);

lock_page(page);
if (PageDirty(page)) {
ClearPageDirty(page);
page->mapping->writepage(page);
}
UnlockPage(page);
page_cache_put(page);
spin_lock(&page_cache_lock);
}
spin_unlock(&page_cache_lock);

and again note how we can move it to the clean list early and we don't
have to keep the PageDirty bit 100% in sync with which list is it on. If
somebody marks it dirty later on (and the dirty bit is still set), that
somebody won't move it back to the dirty list (because it noticved that
the dirty bit is already set), but that's ok: as long as we do the
"ClearPageDirty(page);" call before startign the actual writeout(), we're
fine.

So the "mapping->dirty_pages" list is maybe not so much a _dirty_ list, as
a "scheduled for writeout" list. Marking the page clean doesn't remove it
from that list - it can happily stay on the list and then when the
writeout is started we'd just skip it.

Ok?

Linus

2000-12-29 01:54:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5



On Fri, 29 Dec 2000, Stefan Traby wrote:
> On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:
>
> > Too bad. Maybe somebody should tell gcc maintainers about programmers that
> > know more than the compiler again.
>
> I know that {p,}gcc-2.95.2{,.1} are not officially supported.

Hmm, I use gcc-2.95.2 myself on some machines, and while I'm not 100%
comfortable with it, it does count as "supported" even if it has known
problems with "long long". pgcc isn't.

> Did you know that it's impossible to compile nfsv4 because of
> register allocation problems with long long since (long long) month ?

lockd v4 (for NFS v3), I assume.

No, I wasn't aware of this particular bug.

> The following does not hurt, it's just a fix for a broken
> compiler:

Ugh, that's ugly.

Can you test if it is sufficient to just simplify the math a bit, instead
of uglyfing that function more? The nlm4_encode_lock() function already
tests for NLM4_OFFSET_MAX explicitly for both start and end, so it should
be ok to just re-code the function to not do the extra "loff_t_to_s64()"
stuff, and simplify it enough that the compile rwill be happy to compile
the simpler function. Something along the lines of

if (.. NLM4_OFFSET_MAX tests ..)
..

*p++ = htonl(fl->fl_pid);

start = fl->fl_start;
len = fl->fl_end - start;
if (fl->fl_end == OFFSET_MAX)
len = 0;

p = xdr_encode_hyper(p, start);
p = xdr_encode_hyper(p, len);

return p;

Where it tries to minimize the liveness of the 64-bit values, and tries to
avoid extra complications.

Linus

2000-12-29 06:36:09

by Albert Cranford

[permalink] [raw]
Subject: Re: test13-pre5

Simply executing
*p++ = htonl(fl->fl_pid);
before
start = loff_t_to_s64(fl->fl_start);
also works.
Later,
Albert

Linus Torvalds wrote:
>
> On Fri, 29 Dec 2000, Stefan Traby wrote:
> > On Thu, Dec 28, 2000 at 03:37:51PM -0800, Linus Torvalds wrote:
> >
> > > Too bad. Maybe somebody should tell gcc maintainers about programmers that
> > > know more than the compiler again.
> >
> > I know that {p,}gcc-2.95.2{,.1} are not officially supported.
>
> Hmm, I use gcc-2.95.2 myself on some machines, and while I'm not 100%
> comfortable with it, it does count as "supported" even if it has known
> problems with "long long". pgcc isn't.
>
> > Did you know that it's impossible to compile nfsv4 because of
> > register allocation problems with long long since (long long) month ?
>
> lockd v4 (for NFS v3), I assume.
>
> No, I wasn't aware of this particular bug.
>
> > The following does not hurt, it's just a fix for a broken
> > compiler:
>
> Ugh, that's ugly.
>
> Can you test if it is sufficient to just simplify the math a bit, instead
> of uglyfing that function more? The nlm4_encode_lock() function already
> tests for NLM4_OFFSET_MAX explicitly for both start and end, so it should
> be ok to just re-code the function to not do the extra "loff_t_to_s64()"
> stuff, and simplify it enough that the compile rwill be happy to compile
> the simpler function. Something along the lines of
>
> if (.. NLM4_OFFSET_MAX tests ..)
> ..
>
> *p++ = htonl(fl->fl_pid);
>
> start = fl->fl_start;
> len = fl->fl_end - start;
> if (fl->fl_end == OFFSET_MAX)
> len = 0;
>
> p = xdr_encode_hyper(p, start);
> p = xdr_encode_hyper(p, len);
>
> return p;
>
> Where it tries to minimize the liveness of the 64-bit values, and tries to
> avoid extra complications.
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

--
Albert Cranford Deerfield Beach FL USA
[email protected]

2000-12-29 12:39:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: test13-pre5

Marcelo Tosatti wrote:
>
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
>
> > - make "SetPageDirty()" do something like
> >
> > if (!test_and_set(PG_dirty, &page->flags)) {
> > spin_lock(&page_cache_lock);
> > list_del(page->list);
> > list_add(page->list, page->mapping->dirty_pages);
> > spin_unlock(&page_cache_lock);
> > }
>
> We also want to move the page to the per-address-space clean list in
> ClearPageDirty I suppose.

I'd like to suggest taking this opportunity to regularize the notation
by going to set_page_dirty/clear_page_dirty which will call
SetPageDirty/ClearPageDirty.

--
Daniel

2000-12-29 15:25:19

by Stefan Traby

[permalink] [raw]
Subject: Re: test13-pre5

On Fri, Dec 29, 2000 at 01:06:30AM +0000, Albert Cranford wrote:
> Simply executing
> *p++ = htonl(fl->fl_pid);
> before
> start = loff_t_to_s64(fl->fl_start);
> also works.

Yes, confirmed.
Since you're located in Florida I vote for this and I hope that
Linus will elect it. :)


--- linux/fs/lockd/xdr4.c.orig Fri Dec 29 01:35:32 2000
+++ linux/fs/lockd/xdr4.c Fri Dec 29 14:56:07 2000
@@ -167,13 +167,13 @@
|| (fl->fl_end > NLM4_OFFSET_MAX && fl->fl_end != OFFSET_MAX))
return NULL;

+ *p++ = htonl(fl->fl_pid);
start = loff_t_to_s64(fl->fl_start);
if (fl->fl_end == OFFSET_MAX)
len = 0;
else
len = loff_t_to_s64(fl->fl_end - fl->fl_start + 1);

- *p++ = htonl(fl->fl_pid);
p = xdr_encode_hyper(p, start);
p = xdr_encode_hyper(p, len);

--

ciao -
Stefan

" export PS1="((((((((((((rms))))))))))))# " "

Stefan Traby Linux/ia32 fax: +43-3133-6107-9
Mitterlasznitzstr. 13 Linux/alpha phone: +43-3133-6107-2
8302 Nestelbach Linux/sparc http://www.hello-penguin.com
Austria mailto://[email protected]
Europe mailto://[email protected]

2000-12-29 16:14:29

by Mark Hemment

[permalink] [raw]
Subject: Re: test13-pre5

Hi,

On Thu, 28 Dec 2000, David S. Miller wrote:
> Date: Thu, 28 Dec 2000 23:17:22 +0100
> From: Andi Kleen <[email protected]>
>
> Would you consider patches for any of these points?
>
> To me it seems just as important to make sure struct page is
> a power of 2 in size, with the waitq debugging turned off this
> is true for both 32-bit and 64-bit hosts last time I checked.

Checking test11 (which I'm running here), even with waitq debugging
turned off, on 32-bit (IA32) the struct page is 68bytes (since
the "age" member was re-introduced a while back).

For my development testing, I'm running a _heavily_ hacked kernel. One
of these hacks is to pull the wait_queue_head out of struct page; the
waitq-heads are in a separate allocated area of memory, with a waitq-head
pointer embedded in the page structure (allocated/initialised in
free_area_init_core()). This gives a page structure of 60bytes, giving me
one free double-word to play with (which I'm using as a pointer to a
release function).

Infact, there doesn't need to be a waitq-head allocated for each page
structure - they can share; with a performance overhead on a false
wakeup in __wait_on_page().
Note, for those of us running on 32bit with lots of physical memory, the
available virtual address-space is of major consideration. Reducing the
size of the page structure is more than just reducing cache misses - it
gives us more virtual to play with...

Mark

2000-12-29 17:01:21

by Tim Wright

[permalink] [raw]
Subject: Re: test13-pre5

On Fri, Dec 29, 2000 at 03:46:22PM +0000, Mark Hemment wrote:
> Note, for those of us running on 32bit with lots of physical memory, the
> available virtual address-space is of major consideration. Reducing the
> size of the page structure is more than just reducing cache misses - it
> gives us more virtual to play with...
>
> Mark
>

Yes, this is a very important point if we ever want to make serious use
of large memory machines on ia32. We ran into this with DYNIX/ptx when the
P6 added 36-bit physical addressing. Conserving KVA (kernel virtual address
space), became a very high priority. Eventually, we had to add code to play
silly segment games and "magically" materialize and dematerialize a 4GB
kernel virtual address space instead of the 1GB. This only comes into play
with really large amounts of memory, and is almost certainly not worth the
agony of implementation on Linux, but we'll need to be careful elsewhere to
conserve it as much as possible.

Regards,

Tim


--
Tim Wright - [email protected] or [email protected] or [email protected]
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI

2000-12-29 18:22:39

by Mark Hemment

[permalink] [raw]
Subject: Re: test13-pre5

On Fri, 29 Dec 2000, Tim Wright wrote:
> Yes, this is a very important point if we ever want to make serious use
> of large memory machines on ia32. We ran into this with DYNIX/ptx when the
> P6 added 36-bit physical addressing. Conserving KVA (kernel virtual address
> space), became a very high priority. Eventually, we had to add code to play
> silly segment games and "magically" materialize and dematerialize a 4GB
> kernel virtual address space instead of the 1GB. This only comes into play
> with really large amounts of memory, and is almost certainly not worth the
> agony of implementation on Linux, but we'll need to be careful elsewhere to
> conserve it as much as possible.

Indeed. I'm compiling my kernels with 2GB virtual. Not as I want more
NORMAL pages in the page cache (HIGH memory is fine), but as I need
NORMAL pages for kernel data/structures (memory allocated from
slab-caches) which need to be constantly mapped in.

Mark


2000-12-29 18:48:13

by Tom Rini

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, Dec 28, 2000 at 12:25:23PM -0800, Linus Torvalds wrote:

> - pre5:
> - NIIBE Yutaka: SuperH update
> - Geert Uytterhoeven: m68k update
> - David Miller: TCP RTO calc fix, UDP multicast fix etc
> - Duncan Laurie: ServerWorks PIRQ routing definition.
> - mm PageDirty cleanups, added sanity checks, and don't lose the bit.

I just noticed this (playing with some other stuff), but ext2 as a module
is currently broken:
$ make INSTALL_MOD_PATH=/tmp/foo modules_install
...
if [ -r System.map ]; then /sbin/depmod -ae -F System.map -b /tmp/foo -r 2.4.0-test12; fi
depmod: *** Unresolved symbols in /tmp/foo/lib/modules/2.4.0-test12/kernel/fs/ext2/ext2.o
depmod: buffer_insert_inode_queue
depmod: fsync_inode_buffers

I tried the following locally and it fixes it.
---<cut>---
--- fs/Makefile.orig Fri Dec 29 10:35:50 2000
+++ fs/Makefile Fri Dec 29 10:36:06 2000
@@ -7,7 +7,7 @@

O_TARGET := fs.o

-export-objs := filesystems.o
+export-objs := filesystems.o buffer.o
mod-subdirs := nls

obj-y := open.o read_write.o devices.o file_table.o buffer.o \
--- fs/buffer.c.orig Fri Dec 29 10:33:21 2000
+++ fs/buffer.c Fri Dec 29 10:35:46 2000
@@ -29,6 +29,7 @@
/* async buffer flushing, 1999 Andrea Arcangeli <[email protected]> */

#include <linux/config.h>
+#include <linux/module.h>
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/malloc.h>
@@ -579,6 +580,8 @@
spin_unlock(&lru_list_lock);
}

+EXPORT_SYMBOL(buffer_insert_inode_queue);
+
/* The caller must have the lru_list lock before calling the
remove_inode_queue functions. */
static void __remove_inode_queue(struct buffer_head *bh)
@@ -900,6 +903,7 @@
return err2;
}

+EXPORT_SYMBOL(fsync_inode_buffers);

/*
* osync is designed to support O_SYNC io. It waits synchronously for
---<end>---

--
Tom Rini (TR1265)
http://gate.crashing.org/~trini/

2000-12-29 19:12:47

by Marcelo Tosatti

[permalink] [raw]
Subject: [PATCH] filemap_fdatasync & related changes


On Thu, 28 Dec 2000, Linus Torvalds wrote:

> All done. It looks something like 5-10 places, most of which are about 10
> lines of diff each, if even that.
>
> The only real worry would be that the locking isn't rigth, but getting the
> pagemap lock should be the safe thing, and from a lock contention
> standpoint it should be ok (if we move a lot of pages back and forth, lock
> contention is going to be the least of our worries, because it implies
> that we'd be doing a LOT of IO to actually write the pages out).
>
> If somebody (you? hint, hint) wants to do this, I'd be very happy - I can
> do it myself, but because it's my birthday I'm supposed to drag myself off
> the computer soon and be social, or Tove will be grumpy.

Ok, here it is.

I hope I got all locking and all special cases right.

Comments ?


diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/fs/buffer.c linux/fs/buffer.c
--- linux.orig/fs/buffer.c Fri Dec 29 15:30:52 2000
+++ linux/fs/buffer.c Fri Dec 29 16:21:17 2000
@@ -320,6 +320,46 @@
return 0;
}

+/**
+ * filemap_fdatasync - walk the list of dirty pages of the given address space
+ * and writepage() all of them.
+ *
+ * @mapping: address space structure to write
+ *
+ */
+void filemap_fdatasync(struct address_space * mapping)
+{
+ int (*writepage)(struct page *) = mapping->a_ops->writepage;
+
+ spin_lock(&pagecache_lock);
+
+ while (!list_empty(&mapping->dirty_pages)) {
+ struct page *page = list_entry(mapping->dirty_pages.next,
+ struct page, list);
+
+ list_del(&page->list);
+ list_add(&page->list, &mapping->clean_pages);
+
+ if (!PageDirty(page))
+ continue;
+
+ page_cache_get(page);
+ spin_unlock(&pagecache_lock);
+
+ lock_page(page);
+
+ if (PageDirty(page)) {
+ ClearPageDirty(page);
+ writepage(page);
+ }
+
+ UnlockPage(page);
+ page_cache_release(page);
+ spin_lock(&pagecache_lock);
+ }
+ spin_unlock(&pagecache_lock);
+}
+
/*
* filp may be NULL if called via the msync of a vma.
*/
@@ -367,6 +407,8 @@
if (!file->f_op || !file->f_op->fsync)
goto out_putf;

+ filemap_fdatasync(inode->i_mapping);
+
/* We need to protect against concurrent writers.. */
down(&inode->i_sem);
err = file->f_op->fsync(file, dentry, 0);
@@ -396,6 +438,8 @@
err = -EINVAL;
if (!file->f_op || !file->f_op->fsync)
goto out_putf;
+
+ filemap_fdatasync(inode->i_mapping);

down(&inode->i_sem);
err = file->f_op->fsync(file, dentry, 1);
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/fs/inode.c linux/fs/inode.c
--- linux.orig/fs/inode.c Fri Dec 1 05:50:27 2000
+++ linux/fs/inode.c Fri Dec 29 16:14:09 2000
@@ -100,7 +100,8 @@
memset(inode, 0, sizeof(*inode));
init_waitqueue_head(&inode->i_wait);
INIT_LIST_HEAD(&inode->i_hash);
- INIT_LIST_HEAD(&inode->i_data.pages);
+ INIT_LIST_HEAD(&inode->i_data.clean_pages);
+ INIT_LIST_HEAD(&inode->i_data.dirty_pages);
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_dirty_buffers);
sema_init(&inode->i_sem, 1);
@@ -206,6 +207,8 @@
inode->i_state |= I_LOCK;
inode->i_state &= ~I_DIRTY;
spin_unlock(&inode_lock);
+
+ filemap_fdatasync(inode->i_mapping);

write_inode(inode, sync);

diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/mm/filemap.c linux/mm/filemap.c
--- linux.orig/mm/filemap.c Fri Dec 29 15:30:55 2000
+++ linux/mm/filemap.c Fri Dec 29 16:18:01 2000
@@ -71,7 +71,7 @@

static inline void add_page_to_inode_queue(struct address_space *mapping, struct page * page)
{
- struct list_head *head = &mapping->pages;
+ struct list_head *head = &mapping->clean_pages;

mapping->nrpages++;
list_add(&page->list, head);
@@ -144,7 +144,7 @@
struct list_head *head, *curr;
struct page * page;

- head = &inode->i_mapping->pages;
+ head = &inode->i_mapping->clean_pages;

spin_lock(&pagecache_lock);
spin_lock(&pagemap_lru_lock);
@@ -204,26 +204,12 @@
page_cache_release(page);
}

-/**
- * truncate_inode_pages - truncate *all* the pages from an offset
- * @mapping: mapping to truncate
- * @lstart: offset from with to truncate
- *
- * Truncate the page cache at a set offset, removing the pages
- * that are beyond that offset (and zeroing out partial pages).
- * If any page is locked we wait for it to become unlocked.
- */
-void truncate_inode_pages(struct address_space * mapping, loff_t lstart)
+void truncate_list_pages(struct list_head *head, unsigned long start, unsigned partial)
{
- struct list_head *head, *curr;
+ struct list_head *curr;
struct page * page;
- unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
- unsigned long start;
-
- start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

repeat:
- head = &mapping->pages;
spin_lock(&pagecache_lock);
curr = head->next;
while (curr != head) {
@@ -267,6 +253,25 @@
spin_unlock(&pagecache_lock);
}

+
+/**
+ * truncate_inode_pages - truncate *all* the pages from an offset
+ * @mapping: mapping to truncate
+ * @lstart: offset from with to truncate
+ *
+ * Truncate the page cache at a set offset, removing the pages
+ * that are beyond that offset (and zeroing out partial pages).
+ * If any page is locked we wait for it to become unlocked.
+ */
+void truncate_inode_pages(struct address_space * mapping, loff_t lstart)
+{
+ unsigned long start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+
+ truncate_list_pages(&mapping->clean_pages, start, partial);
+ truncate_list_pages(&mapping->dirty_pages, start, partial);
+}
+
static inline struct page * __find_page_nolock(struct address_space *mapping, unsigned long offset, struct page *page)
{
goto inside;
@@ -328,14 +333,12 @@
return error;
}

-static int do_buffer_fdatasync(struct inode *inode, unsigned long start, unsigned long end, int (*fn)(struct page *))
+static int do_buffer_fdatasync(struct list_head *head, unsigned long start, unsigned long end, int (*fn)(struct page *))
{
- struct list_head *head, *curr;
+ struct list_head *curr;
struct page *page;
int retval = 0;

- head = &inode->i_mapping->pages;
-
spin_lock(&pagecache_lock);
curr = head->next;
while (curr != head) {
@@ -374,8 +377,18 @@
{
int retval;

- retval = do_buffer_fdatasync(inode, start_idx, end_idx, writeout_one_page);
- retval |= do_buffer_fdatasync(inode, start_idx, end_idx, waitfor_one_page);
+ /* writeout dirty buffers on pages from both clean and dirty lists */
+ retval = do_buffer_fdatasync(&inode->i_mapping->dirty_pages, start_idx,
+ end_idx, writeout_one_page);
+ retval |= do_buffer_fdatasync(&inode->i_mapping->clean_pages, start_idx,
+ end_idx, writeout_one_page);
+
+ /* now wait for locked buffers on pages from both clean and dirty lists */
+ retval |= do_buffer_fdatasync(&inode->i_mapping->dirty_pages, start_idx,
+ end_idx, writeout_one_page);
+ retval |= do_buffer_fdatasync(&inode->i_mapping->clean_pages, start_idx,
+ end_idx, waitfor_one_page);
+
return retval;
}

diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/mm/memory.c linux/mm/memory.c
--- linux.orig/mm/memory.c Mon Dec 11 19:45:42 2000
+++ linux/mm/memory.c Fri Dec 29 16:02:23 2000
@@ -259,22 +259,22 @@
/*
* Return indicates whether a page was freed so caller can adjust rss
*/
-static inline int free_pte(pte_t page)
+static inline int free_pte(pte_t pte)
{
- if (pte_present(page)) {
- struct page *ptpage = pte_page(page);
- if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+ if (pte_present(pte)) {
+ struct page *page = pte_page(pte);
+ if ((!VALID_PAGE(page)) || PageReserved(page))
return 0;
/*
* free_page() used to be able to clear swap cache
* entries. We may now have to do it manually.
*/
- if (pte_dirty(page))
- SetPageDirty(ptpage);
- free_page_and_swap_cache(ptpage);
+ if (pte_dirty(pte) && page->mapping)
+ SetPageDirty(page);
+ free_page_and_swap_cache(page);
return 1;
}
- swap_free(pte_to_swp_entry(page));
+ swap_free(pte_to_swp_entry(pte));
return 0;
}

diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/mm/swap_state.c linux/mm/swap_state.c
--- linux.orig/mm/swap_state.c Fri Dec 29 15:30:55 2000
+++ linux/mm/swap_state.c Fri Dec 29 16:10:15 2000
@@ -29,9 +29,13 @@
};

struct address_space swapper_space = {
- { /* pages */
- &swapper_space.pages, /* .next */
- &swapper_space.pages /* .prev */
+ { /* clean_pages */
+ &swapper_space.clean_pages, /* .next */
+ &swapper_space.clean_pages /* .prev */
+ },
+ { /* dirty_pages */
+ &swapper_space.dirty_pages, /* .next */
+ &swapper_space.dirty_pages /* .prev */
},
0, /* nrpages */
&swap_aops,
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/include/linux/fs.h linux/include/linux/fs.h
--- linux.orig/include/linux/fs.h Fri Dec 29 15:30:54 2000
+++ linux/include/linux/fs.h Fri Dec 29 15:53:08 2000
@@ -363,8 +363,9 @@
};

struct address_space {
- struct list_head pages; /* list of pages */
- unsigned long nrpages; /* number of pages */
+ struct list_head clean_pages; /* list of clean pages */
+ struct list_head dirty_pages; /* list of dirty pages */
+ unsigned long nrpages; /* number of total pages */
struct address_space_operations *a_ops; /* methods */
void *host; /* owner: inode, block_device */
struct vm_area_struct *i_mmap; /* list of private mappings */
@@ -1064,6 +1065,7 @@
extern int fsync_inode_buffers(struct inode *);
extern int osync_inode_buffers(struct inode *);
extern int inode_has_buffers(struct inode *);
+extern void filemap_fdatasync(struct address_space *);
extern void sync_supers(kdev_t);
extern int bmap(struct inode *, int);
extern int notify_change(struct dentry *, struct iattr *);
diff -Nur --exclude-from=/home/marcelo/exclude linux.orig/include/linux/mm.h linux/include/linux/mm.h
--- linux.orig/include/linux/mm.h Fri Dec 29 15:30:54 2000
+++ linux/include/linux/mm.h Fri Dec 29 16:07:39 2000
@@ -181,7 +181,14 @@
#define SetPageUptodate(page) set_bit(PG_uptodate, &(page)->flags)
#define ClearPageUptodate(page) clear_bit(PG_uptodate, &(page)->flags)
#define PageDirty(page) test_bit(PG_dirty, &(page)->flags)
-#define SetPageDirty(page) set_bit(PG_dirty, &(page)->flags)
+#define SetPageDirty(page) \
+ if (!test_and_set_bit(PG_dirty, &page->flags)) { \
+ spin_lock(&pagecache_lock); \
+ list_del(&page->list); \
+ list_add(&page->list, &page->mapping->dirty_pages); \
+ spin_unlock(&pagecache_lock); \
+ }
+
#define ClearPageDirty(page) clear_bit(PG_dirty, &(page)->flags)
#define PageLocked(page) test_bit(PG_locked, &(page)->flags)
#define LockPage(page) set_bit(PG_locked, &(page)->flags)


2000-12-29 21:44:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] filemap_fdatasync & related changes

In article <[email protected]>,
Marcelo Tosatti <[email protected]> wrote:
>
>Ok, here it is.
>
>I hope I got all locking and all special cases right.
>
>Comments ?

Looks good.

There's a few things this misses, the worst of which were all my bugs in
the original description. Things like "don't unlock the page after
calling writepage, becasue writepage will do it on its own". Details
like that.

The worst problem actually ends up being the fact that the global sync()
doesn't work, after all, because if it's associated with syncing the
inodes it will only sync the dirty page list for inodes that are dirty.
Now, this probably doesn't show up in testing, because mostly if you
have dirty pages the inode too _will_ be dirty, but that only makes the
bug more insidious.

So I'll work on this a bit to polish up these problems, but on the whole
this is all fine.

Linus

2000-12-29 22:11:48

by David Miller

[permalink] [raw]
Subject: Re: test13-pre5

Date: Fri, 29 Dec 2000 15:46:22 +0000 (GMT)
From: Mark Hemment <[email protected]>

For my development testing, I'm running a _heavily_ hacked
kernel. One of these hacks is to pull the wait_queue_head out of
struct page; the waitq-heads are in a separate allocated area of
memory, with a waitq-head pointer embedded in the page structure
(allocated/initialised in free_area_init_core()). This gives a
page structure of 60bytes, giving me one free double-word to play
with (which I'm using as a pointer to a release function).

Not something like those damn Solaris turnstiles, no please....

Later,
David S. Miller
[email protected]

2000-12-29 22:22:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5



On Fri, 29 Dec 2000, David S. Miller wrote:
>
> For my development testing, I'm running a _heavily_ hacked
> kernel. One of these hacks is to pull the wait_queue_head out of
> struct page; the waitq-heads are in a separate allocated area of
> memory, with a waitq-head pointer embedded in the page structure
> (allocated/initialised in free_area_init_core()). This gives a
> page structure of 60bytes, giving me one free double-word to play
> with (which I'm using as a pointer to a release function).
>
> Not something like those damn Solaris turnstiles, no please....

If you want to have a release function, please just use "page->mapping",
which gives you much more, including memory pressure indicators etc. Now
_that_ can be useful for doing things like slab caches.

Linus

2000-12-31 15:44:30

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: test13-pre5

On Thu, 28 Dec 2000, Linus Torvalds wrote:
> On Thu, 28 Dec 2000, Andi Kleen wrote:
> > - Instead of having a zone pointer mask use a 8 or 16 byte index into a
> > zone table. On a modern CPU it is much cheaper to do the and/shifts than
> > to do even a single cache miss during page aging. On a lot of systems
> > that zone index could be hardcoded to 0 anyways, giving better code.
> > - Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
> > has the same swapping algorithm even only uses 8bit)
>
> This would be good, but can be hard.
>
> FreeBSD doesn't try to be portable any more, but Linux does, and there are
> architectures where 8- and 16-bit accesses aren't atomic but have to be
> done with read-modify-write cycles.
>
> And even for fields like "age", where we don't care whether the age itself
> is 100% accurate, we _do_ care that the fields close-by don't get strange
> effects from updating "age". We used to have exactly this problem on alpha
> back in the 2.1.x timeframe.
>
> This is why a lot of fields are 32-bit, even though we wouldn't need more
> than 8 or 16 bits of them.

What about defining new types for this? Like e.g. `x8', being `u8' on platforms
were that's OK, and `u32' on platforms where that's more efficient?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2000-12-31 17:52:19

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Sat, Dec 30, 2000 at 02:24:06PM +0100, Geert Uytterhoeven wrote:
> On Thu, 28 Dec 2000, Linus Torvalds wrote:
> > On Thu, 28 Dec 2000, Andi Kleen wrote:
> > > - Instead of having a zone pointer mask use a 8 or 16 byte index into a
> > > zone table. On a modern CPU it is much cheaper to do the and/shifts than
> > > to do even a single cache miss during page aging. On a lot of systems
> > > that zone index could be hardcoded to 0 anyways, giving better code.
> > > - Instead of using 4/8 bytes for the age use only 16bit (FreeBSD which
> > > has the same swapping algorithm even only uses 8bit)
> >
> > This would be good, but can be hard.
> >
> > FreeBSD doesn't try to be portable any more, but Linux does, and there are
> > architectures where 8- and 16-bit accesses aren't atomic but have to be
> > done with read-modify-write cycles.
> >
> > And even for fields like "age", where we don't care whether the age itself
> > is 100% accurate, we _do_ care that the fields close-by don't get strange
> > effects from updating "age". We used to have exactly this problem on alpha
> > back in the 2.1.x timeframe.
> >
> > This is why a lot of fields are 32-bit, even though we wouldn't need more
> > than 8 or 16 bits of them.
>
> What about defining new types for this? Like e.g. `x8', being `u8' on platforms
> were that's OK, and `u32' on platforms where that's more efficient?

Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
embedded systems, where you could trade a bit of CPU for less memory overhead
even on systems where u8 is slow and atomicity doesn't come into play
because it's UP anyways.

Only problem I see is that when the programmer was wrong about the possible
range (which sometimes happens) then it could mysteriously work on some
machines and fail on others. This is already the case e.g. with atomic_t,
which is shorter than 32bit e.g. on sparc32, so it is probably not a too
big problem.


-Andi

/* asm/types.h for a random 32bit machine with no byte access */
#if defined(CONFIG_SPACE_EFFICIENT) && !defined(CONFIG_SMP)
typedef __u8 x8;
typedef __u16 x16;
typedef __u32 x32;
#else
typedef __u32 x8;
typedef __u32 x16;
typedef __u32 x32;
#endif

-Andi

2000-12-31 17:58:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5



On Sun, 31 Dec 2000, Andi Kleen wrote:
>
> Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> embedded systems, where you could trade a bit of CPU for less memory overhead
> even on systems where u8 is slow and atomicity doesn't come into play
> because it's UP anyways.

UP has nothing to do with it.

The alpha systems I remember this problem on were all SMP.

Imagine an architecture where you need to do a

load_32()
mask-and-insert-byte
store_32()

and imagine that an interrupt comes in:

load_32()
mask-and-insert-byte

* INTERRUPT *

load_32()
mask-and-insert-ANOTHER-byte
store_32()

interrupt return

store_32()

and notice how the value written by the interrupt is gone, gone, gone,
even though it was to a completely different byte.

Now, imagine that the first byte is the "age", and imagine that the thing
the interrupt tries to update is "flags".

Yes, you're screwed.

I don't think it's a good diea.

Linus

2000-12-31 18:07:43

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
>
>
> On Sun, 31 Dec 2000, Andi Kleen wrote:
> >
> > Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> > embedded systems, where you could trade a bit of CPU for less memory overhead
> > even on systems where u8 is slow and atomicity doesn't come into play
> > because it's UP anyways.
>
> UP has nothing to do with it.
>
> The alpha systems I remember this problem on were all SMP.

AFAIK alpha has byte instructions now.


-Andi

2000-12-31 18:36:50

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
>
>
> On Sun, 31 Dec 2000, Andi Kleen wrote:
> >
> > Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> > embedded systems, where you could trade a bit of CPU for less memory overhead
> > even on systems where u8 is slow and atomicity doesn't come into play
> > because it's UP anyways.
>
> UP has nothing to do with it.
>
> The alpha systems I remember this problem on were all SMP.
[...]

I just checked all architecture manuals I could lay my hands on
(sparcv9, ppc32, mips r4400, parisc 1.1, alpha, sh is somewhere in
storage but as I remember it has it too)
and they all seem to have at least store byte and mostly store
half words instructions.

>
> Imagine an architecture where you need to do a
>
> load_32()
> mask-and-insert-byte
> store_32()

iirc the Alpha guys found out that they couldn't drive half of the
available devices without byte store, and since then nobody has
repeated that mistake @)


>
> I don't think it's a good diea.

I don't see it. Just define x8 to u32 on old alpha and let most other architectures
be happy.


-Andi

2000-12-31 18:38:31

by Matti Aarnio

[permalink] [raw]
Subject: Re: test13-pre5

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
> On Sun, 31 Dec 2000, Andi Kleen wrote:
> >
> > Sounds good. It could also be controlled by a CONFIG_SPACE_EFFICIENT for
> > embedded systems, where you could trade a bit of CPU for less memory overhead
> > even on systems where u8 is slow and atomicity doesn't come into play
> > because it's UP anyways.
>
> UP has nothing to do with it.
> The alpha systems I remember this problem on were all SMP.

Actually nothing SMP specific in that problem sphere.
Alpha has load-locked/store-conditional pair for
this type of memory accesses to automatically detect,
and (conditionally) restart the operation - to form
classical ``locked-read-modify-write'' operations.

In what situations the compiler will use those instructions,
that I don't know. Volatiles, very least, use them.
Will closely packed bytes be processed with it without
them being volatiles ? How about bitfields ?

Newer Alphas have byte/short load/store instructions,
so things really aren't that straight-forward...

....
> I don't think it's a good diea.
>
> Linus

/Matti Aarnio

2000-12-31 18:41:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: test13-pre5

On Sun, Dec 31, 2000 at 09:27:23AM -0800, Linus Torvalds wrote:
> The alpha systems I remember this problem on were all [..]

Yes the granularity issue has nothing to do with SMP (with preemptive kernel
it can trigger even without interrupts involved into the code). Also
CONFIG_SPACE_EFFICIENT looks not necessary.

The x8 name is confusing IMHO (when I read `8' I expect 8bits only, the x
isn't explicit enough). But by using a better name we could save some byte on
alpha ev6 and x86. Something like granular_char/granular_short/granular_int
looks nicer. For the generic 64bit cpu they needs to be _unconditionally_
defined to `long'.

BTW, only old chips (ev[45]) doesn't provide byte granularity. Infact a linux
kernel compiled for ev6 can handle byte granularity also on alpha (it uses
-mcpu=ev6).

alpha reference manual 5.2.2:

[..] For each region, an implementation must support aligned quadword
access and may optionally support aligned longword access or byte
access. If byte access is supported in a region, aligned word access
and aligned longword access are also supported. [..]

21264hrm:

[..] The 21264-generated external references to memory space are
always of a fixed 64-byte size, though the internal access granularity
is byte, word, longword, or quad-word. [..]

Andrea

2000-12-31 18:50:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: test13-pre5

On Sun, Dec 31, 2000 at 06:36:50PM +0100, Andi Kleen wrote:
> AFAIK alpha has byte instructions now.

See other post. Only from ev6 (at least as far as gcc is concerned). I've an
userspace testcase here (it was originally an obscure alpha userspace MM
corruption bug report that I sorted out some time ago) that works only only
when compiled for ev6 because it needs `short' granularity (not even byte
granularity).

Andrea

2000-12-31 19:46:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: test13-pre5

In article <[email protected]>,
Matti Aarnio <[email protected]> wrote:
>
> Actually nothing SMP specific in that problem sphere.
> Alpha has load-locked/store-conditional pair for
> this type of memory accesses to automatically detect,
> and (conditionally) restart the operation - to form
> classical ``locked-read-modify-write'' operations.

Sure, we could make the older alphas use ldl_l stl_c for byte accesses,
but if you thought byte accesses on those machines were kind-of slow
before, just WAIT until that happens.

Old alpha machines (the same ones that would need this code) were
HORRIBLE at ldl_l<->stl_c: they go out all the way to the bus to set the
lock. So suddenly your every byte access ends up being a few hundred
cycles!

So ldl_l/stc_l is not the answer. It would work, but it would be so
slow that you'd be a lot better off not doing it.

I think they fixed ldl/stc later on (so that it only sets a bit locally
that gets cleared by the cache coherency protocol), but as later alphas
have the byte accesses anyway that doesn't matter here. The faster
ldl/stc makes for much faster spinlocks on newer alphas, though.

Linus

2000-12-31 20:20:49

by Andi Kleen

[permalink] [raw]
Subject: Re: test13-pre5

On Sun, Dec 31, 2000 at 11:15:51AM -0800, Linus Torvalds wrote:
> In article <[email protected]>,
> Matti Aarnio <[email protected]> wrote:
> >
> > Actually nothing SMP specific in that problem sphere.
> > Alpha has load-locked/store-conditional pair for
> > this type of memory accesses to automatically detect,
> > and (conditionally) restart the operation - to form
> > classical ``locked-read-modify-write'' operations.
>
> Sure, we could make the older alphas use ldl_l stl_c for byte accesses,
> but if you thought byte accesses on those machines were kind-of slow
> before, just WAIT until that happens.

The older Alphas would just typedef x8/x16 (or granular_u8, granular_u16
or whatever it is called) to u32 and be the same as today. Just most
other boxes would benefit.

This actually all assumes that gcc really uses the byte instructions
for byte stores in structures, which is to be determined.

-Andi