2013-04-14 23:57:40

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH] staging: ramster: add how-to for ramster

From: Dan Magenheimer <[email protected]>

Add how-to for ramster.

Singed-off-by: Dan Magenheimer <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
drivers/staging/zcache/ramster/HOWTO.txt | 249 ++++++++++++++++++++++++++++++
1 file changed, 249 insertions(+)
create mode 100644 drivers/staging/zcache/ramster/HOWTO.txt

diff --git a/drivers/staging/zcache/ramster/HOWTO.txt b/drivers/staging/zcache/ramster/HOWTO.txt
new file mode 100644
index 0000000..e6387e8
--- /dev/null
+++ b/drivers/staging/zcache/ramster/HOWTO.txt
@@ -0,0 +1,249 @@
+Version: 130309
+ Dan Magenheimer <[email protected]>
+
+This is a how-to document for RAMster. It applies to the March 9, 2013
+version of RAMster, re-merged with the new zcache codebase, built and tested
+on the 3.9 tree and submitted for the staging tree for 3.9.
+
+Note that this document was created from notes taken earlier. I would
+appreciate any feedback from anyone who follows the process as described
+to confirm that it works and to clarify any possible misunderstandings,
+or to report problems.
+
+A. PRELIMINARY
+
+1) Install two or more Linux systems that are known to work when upgraded
+ to a recent upstream Linux kernel version (e.g. v3.9). I used Oracle
+ Linux 6 ("OL6") on two Dell Optiplex 790s. Note that it should be possible
+ to use ocfs2 as a filesystem on your systems but this hasn't been
+ tested thoroughly, so if you do use ocfs2 and run into problems, please
+ report them. Up to eight nodes should work, but not much testing has
+ been done with more than three nodes.
+
+On each system:
+
+2) Configure, build and install then boot Linux (e.g. 3.9), just to ensure it
+ can be done with an unmodified upstream kernel. Confirm you booted
+ the upstream kernel with "uname -a".
+
+3) Install ramster-tools. The src.rpm and an OL6 rpm are available
+ in this directory. I'm not very good at userspace stuff and
+ would welcome any help in turning ramster-tools into more
+ distributable rpms/debs for a wider range of distros.
+
+B. BUILDING RAMSTER INTO THE KERNEL
+
+Do the following on each system:
+
+1) Ensure you have the new codebase for drivers/staging/zcache in your source.
+
+2) Change your .config to have:
+
+ CONFIG_CLEANCACHE=y
+ CONFIG_FRONTSWAP=y
+ CONFIG_STAGING=y
+ CONFIG_ZCACHE=y
+ CONFIG_RAMSTER=y
+
+ You may have to reconfigure your kernel multiple times to ensure
+ all of these are set properly. I use:
+
+ # yes "" | make oldconfig
+
+ and then manually check the .config file to ensure my selections
+ have "taken".
+
+ Do not bother to build the kernel until you are certain all of
+ the above config selections will stick for the build.
+
+3) Build this kernel and "make install" so that you have a new kernel
+ in /etc/grub.conf
+
+4) Add "ramster" to the kernel boot line in /etc/grub.conf.
+
+5) Reboot and check dmesg to ensure there are some messages from ramster
+ and that "ramster_enabled=1" appears.
+
+ # dmesg | grep ramster
+
+ You should also see a lot of files in:
+
+ # ls /sys/kernel/debug/zcache
+ # ls /sys/kernel/debug/ramster
+
+ and a few files in:
+
+ # ls /sys/kernel/mm/ramster
+
+ RAMster now will act as a single-system zcache but doesn't yet
+ know anything about the cluster so can't do anything remotely.
+
+C. BUILDING THE RAMSTER CLUSTER
+
+This is the error prone part unless you are a clustering expert. We need
+to describe the cluster in /etc/ramster.conf file and the init scripts
+that parse it are extremely picky about the syntax.
+
+1) Create the /etc/ramster.conf file and ensure it is identical
+ on both systems. There is a good amount of similar documentation
+ for ocfs2 /etc/cluster.conf that can be googled for this, but I use:
+
+ cluster:
+ name = ramster
+ node_count = 2
+ node:
+ name = system1
+ cluster = ramster
+ number = 0
+ ip_address = my.ip.ad.r1
+ ip_port = 7777
+ node:
+ name = system2
+ cluster = ramster
+ number = 0
+ ip_address = my.ip.ad.r2
+ ip_port = 7777
+
+ You must ensure that the "name" field in the file exactly matches
+ the output of "hostname" on each system. The following assumes
+ you use "ramster" as the name of your cluster.
+
+2) Enable the ramster service and configure it:
+
+ # chkconfig --add ramster
+ # service ramster configure
+
+ Set "load on boot" to "y", cluster to start is "ramster" (or whatever
+ name you chose in ramster.conf), heartbeat dead threshold as "500",
+ network idle timeout as "1000000". Leave the others as default.
+
+4) Reboot. After reboot, try:
+
+ # service ramster status
+
+ You should see "Checking ramster cluster ramster: Online". If you do
+ not, something is wrong and RAMster will not work. Note that you
+ should also see that the driver for "configfs" is loaded and mounted,
+ the driver for ocfs2_dlmfs is not loaded, and some numbers for network
+ parameters. You will also see "Checking ramster heartbeat: Not active".
+ That's all OK.
+
+5) Now you need to start the cluster heartbeat; the cluster is not "up"
+ until all nodes detect a heartbeat. Normally this is done via
+ a cluster filesystem, but you don't have one. Some hack-y
+ code in RAMster can start it for you though if you tell it what
+ nodes are "up". To enable it for nodes 0 and 1, do:
+
+ # echo 0 > /sys/kernel/mm/ramster/manual_node_up
+ # echo 1 > /sys/kernel/mm/ramster/manual_node_up
+
+ This must be done on ALL nodes. I usually put these lines
+ in /etc/rc.local as otherwise I forget. To confirm that
+ the cluster is now up, on both systems do:
+
+ # dmesg | grep ramster
+
+ You should see "Accepted connection" messages in dmesg after this.
+
+6) You must tell each node the node to which it should "remotify" pages.
+ For example if you have a three-node cluster and you want nodes
+ 1 and 2 to be "clients" and node 0 to be the "memory server", then
+ on nodes 1 and 2, you do:
+
+ # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
+
+ You should see "ramster: node N set as remotification target"
+ in dmesg. Again, /etc/rc.local is a good place to put this
+ so you don't forget to do it at each boot.
+
+7) One more step: By default, the RAMster code does not "remotify" any
+ pages; this is primarily for testing purposes, but sometimes it is
+ useful. This may change in the future, but for now, you must:
+
+ # echo 1 > /sys/kernel/mm/ramster/pers_remotify_enable
+ # echo 1 > /sys/kernel/mm/ramster/eph_remotify_enable
+
+ The first enables remotifying swap (persistent, aka frontswap) pages,
+ the second enables remotifying of page cache (ephemeral, cleancache)
+ pages.
+
+ These lines can also be put in /etc/rc.local (AFTER the node_up
+ lines), or I often just put them at the beginning of my script that
+ runs a workload.
+
+8) Most testing has been done with both/all machines booted roughly
+ simultaneously. Ideally, you should do this too unless you are
+ trying to break RAMster rather than just use it. ;-)
+
+D. TESTING RAMSTER
+
+1) Note that RAMster has no value unless pages get "remotified". For
+ swap/frontswap/persistent pages, this doesn't happen unless/until
+ the workload would cause swapping to occur, at which point pages
+ are put into frontswap/zcache, and the remotification thread starts
+ working. To get to the point where the system swaps, you either
+ need a workload for which the working set exceeds the RAM in the
+ system; or you need to somehow reduce the amount of RAM one of
+ the system sees. This latter is easy when testing in a VM, but
+ harder on physical systems. In some cases, "mem=xxxM" on the
+ kernel command line restricts memory, but for some values of xxx
+ my kernel fails to boot. I may also try creating a fixed RAMdisk,
+ doing nothing with it, but ensuring that it eats up a fixed
+ amount of RAM.
+2) To see if RAMster is working, on the remote system, I do:
+
+ # watch -d 'cat /sys/kernel/debug/ramster/foreign_*'
+
+ to monitor the number (and max) ephemeral and persistent pages
+ that RAMster has sent. If these stay at 0, RAMster is not working
+ either because the workload isn't creating enough memory pressure
+ or because "remotifying" isn't working. On the system with the
+ workload, you can watch lots of useful information also, but beware
+ that you may be affecting the workload and performance. I use
+ # watch ./watchme
+ where the watchme file contains:
+
+ for i in /sys/kernel/debug/zcache/evicted_buddied_pages \
+ /sys/kernel/debug/zcache/evicted_raw_pages \
+ /sys/kernel/debug/zcache/evicted_unbuddied_pages \
+ /sys/kernel/debug/zcache/zbud_curr_raw_pages \
+ /sys/kernel/debug/zcache/zbud_curr_zbytes \
+ /sys/kernel/debug/zcache/zbud_curr_zpages \
+ /sys/kernel/debug/ramster/eph_pages_remoted \
+ /sys/kernel/debug/ramster/remote_eph_pages_succ_get \
+ /sys/kernel/debug/ramster/remote_pers_pages_succ_get \
+ /sys/kernel/debug/frontswap/succ_puts
+ do
+ echo $i ": " $(cat $i)
+ done
+ And if you have debugfs mounted (as /sys/kernel/debug), you can
+ add to the watchme script some interesting counters in
+ /sys/kernel/debug/cleancache/* and /sys/kernel/debug/frontswap/*
+
+3) In v4, there are known issues in counting certain values. As a result
+ you may see periodic warnings from the kernel. Almost always you
+ will see "ramster: bad accounting for XXX". There are also "WARN_ONCE"
+ messages. If you see kernel warnings with a tombstone, please report
+ them. They are harmless but reflect bugs that need to be eventually fixed.
+
+AUTOMATIC SWAP REPATRIATION
+
+You may notice that while the systems are idle, the foreign persistent
+page count on the remote machine slowly decreases. This is because
+RAMster implements "frontswap selfshrinking": When possible, swap
+pages that have been remotified are slowly repatriated to the local
+machine. This is so that local RAM can be used when possible and
+so that, in case of remote machine crash, the probability of loss
+of data is reduced.
+
+REBOOTING / POWEROFF
+
+If a system is shut down while some of its swap pages still reside
+on a remote system, the system may lock up partially through the shutdown
+sequence. This is because the network is shut down before the
+swap mechansim is shut down. To avoid this annoying problem, simply
+shut off the swap subsystem before starting the shutdown sequence, e.g.:
+
+ # swapoff -a
+ # reboot
+
--
1.7.10.4


2013-04-15 00:08:10

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] staging: ramster: add how-to for ramster

On Mon, Apr 15, 2013 at 07:56:56AM +0800, Wanpeng Li wrote:
> +This is a how-to document for RAMster. It applies to the March 9, 2013
> +version of RAMster, re-merged with the new zcache codebase, built and tested
> +on the 3.9 tree and submitted for the staging tree for 3.9.

This is not needed at all, given that it should just reflect the state
of the code in the kernel that this file is present in. Please remove
it.

> +Note that this document was created from notes taken earlier. I would
> +appreciate any feedback from anyone who follows the process as described
> +to confirm that it works and to clarify any possible misunderstandings,
> +or to report problems.

Is this needed?

> +A. PRELIMINARY
> +
> +1) Install two or more Linux systems that are known to work when upgraded
> + to a recent upstream Linux kernel version (e.g. v3.9). I used Oracle
> + Linux 6 ("OL6") on two Dell Optiplex 790s. Note that it should be possible
> + to use ocfs2 as a filesystem on your systems but this hasn't been
> + tested thoroughly, so if you do use ocfs2 and run into problems, please
> + report them. Up to eight nodes should work, but not much testing has
> + been done with more than three nodes.
> +
> +On each system:
> +
> +2) Configure, build and install then boot Linux (e.g. 3.9), just to ensure it
> + can be done with an unmodified upstream kernel. Confirm you booted
> + the upstream kernel with "uname -a".
> +
> +3) Install ramster-tools. The src.rpm and an OL6 rpm are available
> + in this directory. I'm not very good at userspace stuff and
> + would welcome any help in turning ramster-tools into more
> + distributable rpms/debs for a wider range of distros.

This isn't true, the rpms are not here.

> +B. BUILDING RAMSTER INTO THE KERNEL
> +
> +Do the following on each system:
> +
> +1) Ensure you have the new codebase for drivers/staging/zcache in your source.
> +
> +2) Change your .config to have:
> +
> + CONFIG_CLEANCACHE=y
> + CONFIG_FRONTSWAP=y
> + CONFIG_STAGING=y
> + CONFIG_ZCACHE=y
> + CONFIG_RAMSTER=y
> +
> + You may have to reconfigure your kernel multiple times to ensure
> + all of these are set properly. I use:
> +
> + # yes "" | make oldconfig
> +
> + and then manually check the .config file to ensure my selections
> + have "taken".

This last bit isn't needed at all. Just stick to the "these are the
settings you need enabled."

> + Do not bother to build the kernel until you are certain all of
> + the above config selections will stick for the build.
> +
> +3) Build this kernel and "make install" so that you have a new kernel
> + in /etc/grub.conf

Don't assume 'make install' works for all distros, nor that
/etc/grub.conf is a grub config file (hint, it usually isn't, and what
about all the people not even using grub for their bootloader?)

> +4) Add "ramster" to the kernel boot line in /etc/grub.conf.

Again, drop grub.conf reference

> +5) Reboot and check dmesg to ensure there are some messages from ramster
> + and that "ramster_enabled=1" appears.
> +
> + # dmesg | grep ramster

Are you sure ramster still spits out messages? If so, provide an
example of what it should look like.

> + You should also see a lot of files in:
> +
> + # ls /sys/kernel/debug/zcache
> + # ls /sys/kernel/debug/ramster

You forgot to mention that debugfs needs to be mounted.

> + and a few files in:
> +
> + # ls /sys/kernel/mm/ramster
> +
> + RAMster now will act as a single-system zcache but doesn't yet
> + know anything about the cluster so can't do anything remotely.
> +
> +C. BUILDING THE RAMSTER CLUSTER
> +
> +This is the error prone part unless you are a clustering expert. We need
> +to describe the cluster in /etc/ramster.conf file and the init scripts
> +that parse it are extremely picky about the syntax.
> +
> +1) Create the /etc/ramster.conf file and ensure it is identical
> + on both systems. There is a good amount of similar documentation
> + for ocfs2 /etc/cluster.conf that can be googled for this, but I use:
> +
> + cluster:
> + name = ramster
> + node_count = 2
> + node:
> + name = system1
> + cluster = ramster
> + number = 0
> + ip_address = my.ip.ad.r1
> + ip_port = 7777
> + node:
> + name = system2
> + cluster = ramster
> + number = 0
> + ip_address = my.ip.ad.r2
> + ip_port = 7777
> +
> + You must ensure that the "name" field in the file exactly matches
> + the output of "hostname" on each system. The following assumes
> + you use "ramster" as the name of your cluster.
> +
> +2) Enable the ramster service and configure it:
> +
> + # chkconfig --add ramster
> + # service ramster configure

That's a huge assumption as to how your system config/startup scripts
work, right? Not all the world is using old-style system V init
anymore, what about systemd? openrc?

greg k-h