2003-02-10 10:42:44

by David Howells

[permalink] [raw]
Subject: extra PG_* bits for page->flags


Hi Linus,

I'm writing a cache filesystem for primarily for caching AFS pages, but that
also can be used for caching other network FS pages (such as NFSv4, which Jeff
Garzik and Trond Myklebust are interested in, I think).

This will be journalled because if you've got an enourmous cache with all of
/usr and all your data in it, you don't want to have to re-initialise it if
power goes away (besides, it may have uncommitted data in it).

So what I'd like is for page->flags to acquire two new flags:

(*) PG_journal

This indicates that a journalled operation in progress on the page. It is
only cleared when the operation has it's operation marked as completed in
the journal on disc.

(*) PG_journalmark

This indicates that the above journalling operation is still having its
initial mark made in the journal on disc. It is cleared when the write
for this completes.

Anyone wishing to do a journalled operation to a block in the cachefs would:

(1) Wait for the PG_journal to become clear
(2) Ask for the journal to be marked (which would set both bits)
(3) Wait for PG_journalmark to be cleared
(4) Modify the page appropriately and write it to disc
(5) Ask for the journal to be ACK'd

Actually, step (1) would be done by step (2) in a manner similar to
lock_page().


Would you be willing to accept such an addition?

David

diff -ur linux-2.5.59-orig/include/linux/page-flags.h linux-2.5.59/include/linux/page-flags.h
--- linux-2.5.59-orig/include/linux/page-flags.h 2003-01-17 02:22:24.000000000 +0000
+++ linux-2.5.59/include/linux/page-flags.h 2003-02-09 18:40:03.000000000 +0000
@@ -74,6 +74,9 @@
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be recalimed asap */

+#define PG_journalmark 20 /* journalling mark being written for page */
+#define PG_journal 21 /* journalled operation in progress on page */
+
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
* allowed.
@@ -185,6 +188,18 @@
#define PageChecked(page) test_bit(PG_checked, &(page)->flags)
#define SetPageChecked(page) set_bit(PG_checked, &(page)->flags)

+#define PageJournal(page) test_bit(PG_journal, &(page)->flags)
+#define SetPageJournal(page) set_bit(PG_journal, &(page)->flags)
+#define ClearPageJournal(page) clear_bit(PG_journal, &(page)->flags)
+#define TestClearPageJournal(page) test_and_clear_bit(PG_journal, &(page)->flags)
+#define TestSetPageJournal(page) test_and_set_bit(PG_journal, &(page)->flags)
+
+#define PageJournalMark(page) test_bit(PG_journalmark, &(page)->flags)
+#define SetPageJournalMark(page) set_bit(PG_journalmark, &(page)->flags)
+#define ClearPageJournalMark(page) clear_bit(PG_journalmark, &(page)->flags)
+#define TestClearPageJournalMark(page) test_and_clear_bit(PG_journalmark, &(page)->flags)
+#define TestSetPageJournalMark(page) test_and_set_bit(PG_journalmark, &(page)->flags)
+
#define PageReserved(page) test_bit(PG_reserved, &(page)->flags)
#define SetPageReserved(page) set_bit(PG_reserved, &(page)->flags)
#define ClearPageReserved(page) clear_bit(PG_reserved, &(page)->flags)
diff -ur linux-2.5.59-orig/include/linux/pagemap.h linux-2.5.59/include/linux/pagemap.h
--- linux-2.5.59-orig/include/linux/pagemap.h 2003-01-17 02:21:34.000000000 +0000
+++ linux-2.5.59/include/linux/pagemap.h 2003-02-09 19:34:48.000000000 +0000
@@ -122,4 +122,33 @@
}

extern void end_page_writeback(struct page *page);
+
+/*
+ * Wait for a journalling on a page
+ * - first bit for initiation mark being recorded in journal
+ * - second bit for ACK mark being recorded in journal
+ */
+extern void FASTCALL(__journal_page(struct page *page));
+
+static inline void journal_page(struct page *page)
+{
+ if (TestSetPageJournal(page))
+ __journal_page(page);
+}
+
+static inline void wait_on_page_journal_mark(struct page *page)
+{
+ if (PageJournalMark(page))
+ wait_on_page_bit(page, PG_journalmark);
+}
+
+static inline void wait_on_page_journal_ack(struct page *page)
+{
+ if (PageJournal(page))
+ wait_on_page_bit(page, PG_journal);
+}
+
+extern void FASTCALL(page_marked_journal(struct page *page));
+extern void FASTCALL(page_acked_journal(struct page *page));
+
#endif /* _LINUX_PAGEMAP_H */
diff -ur linux-2.5.59-orig/mm/filemap.c linux-2.5.59/mm/filemap.c
--- linux-2.5.59-orig/mm/filemap.c 2003-01-17 02:22:00.000000000 +0000
+++ linux-2.5.59/mm/filemap.c 2003-02-09 19:35:36.000000000 +0000
@@ -226,6 +226,8 @@
return error;
}

+EXPORT_SYMBOL(add_to_page_cache);
+
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t offset, int gfp_mask)
{
@@ -312,6 +314,36 @@
EXPORT_SYMBOL(end_page_writeback);

/*
+ * Note a journalling op on a page has been ACK'd in the journal
+ */
+void page_acked_journal(struct page *page)
+{
+ wait_queue_head_t *waitqueue = page_waitqueue(page);
+ smp_mb__before_clear_bit();
+ if (!TestClearPageJournal(page))
+ BUG();
+ smp_mb__after_clear_bit();
+ if (waitqueue_active(waitqueue))
+ wake_up_all(waitqueue);
+}
+EXPORT_SYMBOL(page_acked_journal);
+
+/*
+ * Note a journalling op on a page has been made initiated in the journal
+ */
+void page_marked_journal(struct page *page)
+{
+ wait_queue_head_t *waitqueue = page_waitqueue(page);
+ smp_mb__before_clear_bit();
+ if (!TestClearPageJournalMark(page))
+ BUG();
+ smp_mb__after_clear_bit();
+ if (waitqueue_active(waitqueue))
+ wake_up_all(waitqueue);
+}
+EXPORT_SYMBOL(page_marked_journal);
+
+/*
* Get a lock on the page, assuming we need to sleep to get it.
*
* Ugly: running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some
@@ -335,6 +367,29 @@
EXPORT_SYMBOL(__lock_page);

/*
+ * Get a journalling lock on the page, assuming we need to sleep to get it.
+ *
+ * Ugly: running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some
+ * random driver's requestfn sets TASK_RUNNING, we could busywait. However
+ * chances are that on the second loop, the block layer's plug list is empty,
+ * so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
+ */
+void __journal_page(struct page *page)
+{
+ wait_queue_head_t *wqh = page_waitqueue(page);
+ DEFINE_WAIT(wait);
+
+ while (TestSetPageJournal(page)) {
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ sync_page(page);
+ if (PageJournal(page))
+ io_schedule();
+ }
+ finish_wait(wqh, &wait);
+}
+EXPORT_SYMBOL(__journal_page);
+
+/*
* a rather lightweight function, finding and getting a reference to a
* hashed page atomically.
*/
diff -ur linux-2.5.59-orig/mm/page_alloc.c linux-2.5.59/mm/page_alloc.c
--- linux-2.5.59-orig/mm/page_alloc.c 2003-01-17 02:21:38.000000000 +0000
+++ linux-2.5.59/mm/page_alloc.c 2003-02-09 17:10:05.000000000 +0000
@@ -262,7 +262,9 @@
1 << PG_active |
1 << PG_dirty |
1 << PG_reclaim |
- 1 << PG_writeback )))
+ 1 << PG_writeback |
+ 1 << PG_journal |
+ 1 << PG_journalmark)))
bad_page(__FUNCTION__, page);

page->flags &= ~(1 << PG_uptodate | 1 << PG_error |


2003-02-10 23:03:56

by Andrew Morton

[permalink] [raw]
Subject: Re: extra PG_* bits for page->flags

David Howells <[email protected]> wrote:
>
>
> Hi Linus,
>
> I'm writing a cache filesystem for primarily for caching AFS pages, but that
> also can be used for caching other network FS pages (such as NFSv4, which Jeff
> Garzik and Trond Myklebust are interested in, I think).

Is a new fs needed? Is it not possible to use an existing filesystem of the
user's choice for local caching?

> ...
>
> (*) PG_journal
> (*) PG_journalmark

Well. If you new fs goes in then yes, we can spare those bits (just). If
possible, consider making them "opaque address_space-private", so this prime
real estate can be shared with other filesystems (eg, PG_checked).

2003-02-10 23:56:43

by Nigel Cunningham

[permalink] [raw]
Subject: Re: extra PG_* bits for page->flags

On Tue, 2003-02-11 at 12:12, Andrew Morton wrote:
> David Howells <[email protected]> wrote:
> > (*) PG_journal
> > (*) PG_journalmark
>
> Well. If you new fs goes in then yes, we can spare those bits (just).

May I ask, how many bits do you consider available? swsusp beta 18 (ie
2.4), which I'm beginning to port to 2.5, uses 4 bits during suspend &
resume for various purposes. If I understand the code correctly, the
zone flags use bits 24-31 (although there has been that thread saying
they could use less bits). I see in the 2.5.60 patch bit 19 is now in
use. Should I be using private, temporarily allocated bitmaps instead of
the page flags, to ease the pressure? (Especially since the suspend code
is not used in 'normal' operation anyway).

Regards,

Nigel


2003-02-11 00:11:31

by Andrew Morton

[permalink] [raw]
Subject: Re: extra PG_* bits for page->flags

Nigel Cunningham <[email protected]> wrote:
>
> On Tue, 2003-02-11 at 12:12, Andrew Morton wrote:
> > David Howells <[email protected]> wrote:
> > > (*) PG_journal
> > > (*) PG_journalmark
> >
> > Well. If you new fs goes in then yes, we can spare those bits (just).
>
> May I ask, how many bits do you consider available?

Too darn few, frankly.

> swsusp beta 18 (ie
> 2.4), which I'm beginning to port to 2.5, uses 4 bits during suspend &
> resume for various purposes. If I understand the code correctly, the
> zone flags use bits 24-31 (although there has been that thread saying
> they could use less bits). I see in the 2.5.60 patch bit 19 is now in
> use. Should I be using private, temporarily allocated bitmaps instead of
> the page flags, to ease the pressure? (Especially since the suspend code
> is not used in 'normal' operation anyway).

256 zones is fairly exorbitant. I suspect the number of machines which have

a) more than 16 zones and
b) CONFIG_SOFTWARE_SUSPEND

is zero. So you can always munch into the top eight bits.

PG_checked is supposed to be removed - I have not looked into that. PG_slab
is fairly optional.

PG_highmem can go away. (just use page_zone(page)->is_highmem)

I would dearly like to dump PG_reserved, but I doubt if I'll get onto that.
(thinks about what happens if you start a direct-io read from a soundcard DMA
buffer, and munmap/close while that is in progress...)

So. There's not a lot of fat there, but we're not all out of options.


2003-02-11 02:06:34

by Nigel Cunningham

[permalink] [raw]
Subject: Re: extra PG_* bits for page->flags

On Tue, 2003-02-11 at 13:20, Andrew Morton wrote:
> 256 zones is fairly exorbitant. I suspect the number of machines which have
>
> a) more than 16 zones and
> b) CONFIG_SOFTWARE_SUSPEND
>
> is zero. So you can always munch into the top eight bits.

So bits <= 27 should be safe or are guaranteed safe? :> (Keep in mind
that people are starting to port swsusp to other archs)

> PG_checked is supposed to be removed - I have not looked into that. PG_slab
> is fairly optional.
AFAICS, PG_checked is cleared in page_alloc.c and only otherwise used in
fs/ext2/dir.c, freexfs/xvfs_subr.c(commented out actually) and
afs/dir.c.

For PG_slab, should I understand 'fairly optional' to mean 'possibly
dangerous' (ie don't use)?

> PG_highmem can go away. (just use page_zone(page)->is_highmem)

> I would dearly like to dump PG_reserved, but I doubt if I'll get onto that.
> (thinks about what happens if you start a direct-io read from a soundcard DMA
> buffer, and munmap/close while that is in progress...)
>
> So. There's not a lot of fat there, but we're not all out of options.

I can still happily do a few dynamically allocated bitmaps if there are
better things the bits can be used for. (The flags are not costly redo
and need to be redone each suspend & resume anyway).

Of course, on top of all of these questions, Pavel might not like my
code anyway so this might be a moot point :>

Anyway, for the moment, I'll proceed on the assumption that there'll be
enough flags, and I can always switch if needs be.

Regards,

Nigel

2003-02-11 04:08:47

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: extra PG_* bits for page->flags

On Mon, 10 Feb 2003 15:12:44 PST, Andrew Morton said:

> Is a new fs needed? Is it not possible to use an existing filesystem of the
> user's choice for local caching?

It's sort of like a loopback mount - you need a backing store which could be
on an ext3 or ramfs or whatever, and you need the user-visible front end side
of things, which manages the backing store and fetches blocks from AFS/NFS/etc.


Attachments:
(No filename) (226.00 B)

2003-02-11 10:45:55

by David Howells

[permalink] [raw]
Subject: Re: extra PG_* bits for page->flags


Andrew Morton <[email protected]> wrote:

> > I'm writing a cache filesystem for primarily for caching AFS pages, but
> > that also can be used for caching other network FS pages (such as NFSv4,
> > which Jeff Garzik and Trond Myklebust are interested in, I think).
>
> Is a new fs needed? Is it not possible to use an existing filesystem of the
> user's choice for local caching?

The problem with that is who owns the pages being served? With an existing
filesystem either the pages have to belong to the cache FS (and not the net
FS) or double buffering has to be incurred.

What my FS strives to do is make it so the pages are owned by the net FS, the
cache FS just provides operations to perform reads/writes from/to disc to any
page (plus journalling).

The cache FS then only owns the pages which hold its metadata.

Also - though their may be ways around this according to Stephen Tweedie - the
handling of holes in files is different. In my cache FS, a hole as a page that
we don't have from the server yet. In an existing FS, a hole is a page of
zeros that simply doesn't reside on disc.

David

2003-02-11 10:57:52

by David Howells

[permalink] [raw]
Subject: Re: extra PG_* bits for page->flags


[email protected] wrote:
> > Is a new fs needed? Is it not possible to use an existing filesystem of
> > the user's choice for local caching?
>
> It's sort of like a loopback mount - you need a backing store which could be
> on an ext3 or ramfs or whatever, and you need the user-visible front end
> side of things, which manages the backing store and fetches blocks from
> AFS/NFS/etc.

Sort of, yes. Making it a filesystem is more a way of making it possible for
the kernel to associate a block device directly with the caching code. Rather
than adding an additional kernel interface beyond mount/swapon, you can just
use mount to add a block device to the cache.

Furthermore, it meand that I can provide virtual files along the lines of
/proc in the mountpoint that allow you to access status information, and maybe
also control the data cached.

David