LinuxLists.cc - tmem and swap pagecache

2012-06-20 23:11:54

Subject: tmem and swap pagecache

Dragonfly BSD has a feature by which--if you put a swap partition on an
SSD--it will allow you to configure the swapping area to also accept
page cache data. In this way, page cache from spinning hard disks can
be swapped into SSD for acceleration using the swap device.

It seems to me that creating a tmem[1] module that supplies page cache
swap or supplies a swap area with page cache logic (i.e. a swap area
that acts as a combined area, using its entire size as page cache but
evicting page cache for swap when full) would be a good use of tmem.

It isn't apparent to me that tmem has any way of knowing where a page
comes from, however. Specifically, as far as I can tell (I may be
wrong), you can't hint to tmem that a page is from page cache (you can
tell it it's persistent or transient--probably swap or cache, but not
necessarily), much less that it's page cache backed by a spinning disk
versus an SSD or USB drive.

To my senses, it would be useful to be able to have a database server or
mass file server with 4GB or 8GB of RAM and a 128GB SSD ($150, cheaper
than 128GB of RAM...) that can back extended page cache. Effectively
you get an L2 page cache, with maybe 2-3GB of L1 page cache in system
RAM and 108GB of L2 page cache on an SSD, with 20GB in use for the /
file system (which you want to specifically NOT page cache into SSD).

[Tangent]

It may alternately be a generally good idea to check the backing device
of page cache in general and favor evicting SSD-backed page cache over
slow page cache, although now we're getting too specific and we want to
generically call these devices "fast" or "slow" and more specifically
"faster than X" and "Slower than Y" if we're discussing that.

Essentially I mean: what about spinning hard disks on SATA2 vs SATA3 vs
USB vs eSata? eSATA is way faster than USB, SATA3 is faster than SATA2
unless your disk is slower than the SATA2 disk and the SATA2 disk is
slower than SATA2. Obviously you want to evict the cache for a 7200RPM
SATA3 hard disk with 64MB cache more readily than a spinning USB 1.0
hard disk, if they're both used roughly as recently--the SATA3 drive is
10% more likely to be used first than the USB drive but the USB drive
has 500% of the access speed, overall you're likely to come out faster
favoring the USB drive's page cache. As you evict more of the SATA3
drive's cache, what's left has been used far more recently than anything
on the USB drive, so it starts making more sense to evict the old USB
drive's data.

[/Tangent]

But even if you favor SSD eviction over hard drive eviction (by any
metric), you'll eventually come to a point where you only have so much
page cache and having the ability to swap page cache to a huge SSD
becomes more attractive. It becomes an extremely large read cache. The
Seagate Momentus XT fares pretty well with an 8GB read cache for this
(although on their specialized, tightly integrated hardware, they've
managed to use write-back caching to gain a LOT of write performance,
too); it would make sense that using an SSD as a big page cache would
supply similar gains in specialized environments for cheaper than the
cost of RAM.

Of course that just takes us back to the original question: Can we hint
to tmem where the data is coming from, and let it decide if it cares and
how to handle it?

[1] http://lwn.net/Articles/340409/