Howdy folks,
While working on a backport for the following changes:
3399446 swap: discard while swapping only if SWAP_FLAG_DISCARD
052b198 swap: don't do discard if no discard option added
We found ourselves around an interesting discussion on how limiting
the behavior with regard to user-visible swap areas configuration has become
after applying the aforementioned changesets.
Before commit 3399446, if the swap backing device supported DISCARD,
then a batched discard was issued at swapon(8) time, and fine-grained DISCARDs
were issued in between freeing swap page-clusters and re-writing to them.
As noticed at 3399446's commit message, the fine-grained discards often
didn't help on improving performance as expected, and were potentially causing
more trouble than desired. So, commit 3399446 introduced a new swapon flag,
to make the fine-grained discards while swapping conditional.
However a batched discard would have been issued everytime swapon(8) was
turning a new swap area available.
This batched operation that remained at sys_swapon was considered troublesome
for some setups, and specially because a sysadmin was not flagging swapon(8)
to do discards -- http://www.spinics.net/lists/linux-mm/msg31741.html
then, commit 052b198 got merged to address the scenario described above.
After this last commit, now we can either only do both batched and fine-grained
discards for swap, or none of them.
As depicted above, this seems to be not very flexible as it could be,
and the whole discussion we had (internally) left us wondering if does upstream
feel it would be useful to allow for both a batched discard as well as a
fine-grained discard option for swap? (By batched, here, it could mean just
the one time operation at swapon, or something similar to what fstrim does).
In fact, we all agreed with having no discards sent down at all as the default
behaviour. But thinking a little more about the use cases where device supports
discard:
a) and can do it quickly;
b) but it's slow to do in small granularities (or concurrent with other
I/O);
c) but the implementation is so horrendous that you don't even want to
send one down;
And assuming that the sysadmin considers it useful to send the discards down
at all, we would (probably) want the following solutions:
1) do the fine-grained discards if swap device is capable of doing so;
2) do batched discards, either at swapon or via something like fstrim; or
3) turn it off completely (default behavior nowadays)
i.e.: Today, if we have a swap device like (b), we cannot perform (2) even if
it might be regardeed as interesting, or necessary to the workload because it
would imply (1), and the device cannot do that and perform optimally.
With all that in mind, and in order to attempt to sort out the (un)flexibility
problem exposed above, I came up with the following patches:
01 (kernel) swap: discard while swapping only if SWAP_FLAG_DISCARD_CLUSTER
02 (util-linux) swapon: add "cluster-discard" support
Sorry for the long email.
Your feedback is very much appreciated!
-- Rafael
Intruduce a new flag to make page-cluster fine-grained discards while swapping
conditional, as they can be considered detrimental to some setups. However,
keep allowing batched discards at sys_swapon() time, when enabled by the
system administrator.
Signed-off-by: Rafael Aquini <[email protected]>
---
include/linux/swap.h | 8 +++++---
mm/swapfile.c | 12 ++++++++----
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1701ce4..ab2e742 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -19,10 +19,11 @@ struct bio;
#define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
#define SWAP_FLAG_PRIO_MASK 0x7fff
#define SWAP_FLAG_PRIO_SHIFT 0
-#define SWAP_FLAG_DISCARD 0x10000 /* discard swap cluster after use */
+#define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap areas */
+#define SWAP_FLAG_DISCARD_CLUSTER 0x20000 /* discard swap clusters after use */
#define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
- SWAP_FLAG_DISCARD)
+ SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_CLUSTER)
static inline int current_is_kswapd(void)
{
@@ -152,8 +153,9 @@ enum {
SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV = (1 << 6), /* its a block device */
SWP_FILE = (1 << 7), /* set after swap_activate success */
+ SWP_CLUSTERDISCARD = (1 << 8), /* discard swap cluster after usage */
/* add others here before... */
- SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */
+ SWP_SCANNING = (1 << 9), /* refcount in scan_swap_map */
};
#define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c340d9..197461f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -212,7 +212,7 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
si->cluster_nr = SWAPFILE_CLUSTER - 1;
goto checks;
}
- if (si->flags & SWP_DISCARDABLE) {
+ if (si->flags & SWP_CLUSTERDISCARD) {
/*
* Start range check on racing allocations, in case
* they overlap the cluster we eventually decide on
@@ -322,7 +322,7 @@ checks:
if (si->lowest_alloc) {
/*
- * Only set when SWP_DISCARDABLE, and there's a scan
+ * Only set when SWP_CLUSTERDISCARD, and there's a scan
* for a free cluster in progress or just completed.
*/
if (found_free_cluster) {
@@ -2123,8 +2123,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
p->flags |= SWP_SOLIDSTATE;
p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
}
- if ((swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0)
+ if ((swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0) {
p->flags |= SWP_DISCARDABLE;
+ if (swap_flags & SWAP_FLAG_DISCARD_CLUSTER)
+ p->flags |= SWP_CLUSTERDISCARD;
+ }
}
mutex_lock(&swapon_mutex);
@@ -2135,11 +2138,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
enable_swap_info(p, prio, swap_map, frontswap_map);
printk(KERN_INFO "Adding %uk swap on %s. "
- "Priority:%d extents:%d across:%lluk %s%s%s\n",
+ "Priority:%d extents:%d across:%lluk %s%s%s%s\n",
p->pages<<(PAGE_SHIFT-10), name->name, p->prio,
nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
(p->flags & SWP_DISCARDABLE) ? "D" : "",
+ (p->flags & SWP_CLUSTERDISCARD) ? "C" : "",
(frontswap_map) ? "FS" : "");
mutex_unlock(&swapon_mutex);
--
1.7.11.7
Introduce a new swapon flag/option to support more flexible swap discard setup.
The --cluster-discard swapon(8) option can be used by a system admin to flag
sys_swapon() to perform page-cluster fine-grained discards.
This patch also changes the behaviour of swapon(8) --discard option, that now
will only be used to flag sys_swapon() batched discards will be issued at
swapon(8) time.
Signed-off-by: Rafael Aquini <[email protected]>
---
sys-utils/swapon.8 | 19 ++++++++++++++++---
sys-utils/swapon.c | 34 ++++++++++++++++++++++++++--------
2 files changed, 42 insertions(+), 11 deletions(-)
diff --git a/sys-utils/swapon.8 b/sys-utils/swapon.8
index 385bf5a..0c8ac69 100644
--- a/sys-utils/swapon.8
+++ b/sys-utils/swapon.8
@@ -112,10 +112,23 @@ All devices marked as ``swap'' in
are made available, except for those with the ``noauto'' option.
Devices that are already being used as swap are silently skipped.
.TP
+.TP
+.B "\-c, \-\-cluster\-discard"
+Swapping will discard clusters of swap pages in between freeing them
+and re-writing to them, if the swap device supports that. This option
+also implies the
+.I \-d, \-\-discard
+swapon flag.
+The
+.I /etc/fstab
+mount option
+.BI cluster\-discard
+may be also used to enable this flag.
+
+.TP
.B "\-d, \-\-discard"
-Discard freed swap pages before they are reused, if the swap
-device supports the discard or trim operation. This may improve
-performance on some Solid State Devices, but often it does not.
+Enables swap discards, if the swap device supports that, and performs
+a batch discard operation for the swap device at swapon time.
The
.I /etc/fstab
mount option
diff --git a/sys-utils/swapon.c b/sys-utils/swapon.c
index f1e2433..a71f69e 100644
--- a/sys-utils/swapon.c
+++ b/sys-utils/swapon.c
@@ -34,7 +34,11 @@
#endif
#ifndef SWAP_FLAG_DISCARD
-# define SWAP_FLAG_DISCARD 0x10000 /* discard swap cluster after use */
+# define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */
+#endif
+
+#ifndef SWAP_FLAG_DISCARD_CLUSTER
+# define SWAP_FLAG_DISCARD_CLUSTER 0x20000 /* discard swap cluster after use */
#endif
#ifndef SWAP_FLAG_PREFER
@@ -70,7 +74,7 @@ enum {
static int all;
static int priority = -1; /* non-prioritized swap by default */
-static int discard;
+static int discard = 0; /* don't send swap discards by default */
/* If true, don't complain if the device/file doesn't exist */
static int ifexists;
@@ -570,8 +574,11 @@ static int do_swapon(const char *orig_special, int prio,
<< SWAP_FLAG_PRIO_SHIFT);
}
#endif
- if (fl_discard)
+ if (fl_discard) {
flags |= SWAP_FLAG_DISCARD;
+ if (fl_discard > 1)
+ flags |= SWAP_FLAG_DISCARD_CLUSTER;
+ }
status = swapon(special, flags);
if (status < 0)
@@ -615,8 +622,14 @@ static int swapon_all(void)
if (mnt_fs_get_option(fs, "noauto", NULL, NULL) == 0)
continue;
- if (mnt_fs_get_option(fs, "discard", NULL, NULL) == 0)
- dsc = 1;
+ if (mnt_fs_get_option(fs, "discard", NULL, NULL) == 0) {
+ if !(dsc)
+ dsc = 1;
+ }
+ if (mnt_fs_get_option(fs, "cluster-discard", NULL, NULL) == 0) {
+ if (!dsc || dsc == 1)
+ dsc = 2;
+ }
if (mnt_fs_get_option(fs, "nofail", NULL, NULL) == 0)
nofail = 1;
if (mnt_fs_get_option(fs, "pri", &p, NULL) == 0 && p)
@@ -647,7 +660,8 @@ static void __attribute__ ((__noreturn__)) usage(FILE * out)
fputs(USAGE_OPTIONS, out);
fputs(_(" -a, --all enable all swaps from /etc/fstab\n"
- " -d, --discard discard freed pages before they are reused\n"
+ " -c, --cluster-discard discard freed pages before they are reused, while swapping\n"
+ " -d, --discard discard freed pages before they are reused, all at once, at swapon time\n"
" -e, --ifexists silently skip devices that do not exist\n"
" -f, --fixpgsz reinitialize the swap space if necessary\n"
" -p, --priority <prio> specify the priority of the swap device\n"
@@ -696,6 +710,7 @@ int main(int argc, char *argv[])
static const struct option long_opts[] = {
{ "priority", 1, 0, 'p' },
+ { "cluster-discard", 0, 0, 'c' },
{ "discard", 0, 0, 'd' },
{ "ifexists", 0, 0, 'e' },
{ "summary", 0, 0, 's' },
@@ -719,7 +734,7 @@ int main(int argc, char *argv[])
mnt_init_debug(0);
mntcache = mnt_new_cache();
- while ((c = getopt_long(argc, argv, "ahdefp:svVL:U:",
+ while ((c = getopt_long(argc, argv, "ahcdefp:svVL:U:",
long_opts, NULL)) != -1) {
switch (c) {
case 'a': /* all */
@@ -738,8 +753,11 @@ int main(int argc, char *argv[])
case 'U':
add_uuid(optarg);
break;
+ case 'c':
+ discard += 2;
+ break;
case 'd':
- discard = 1;
+ discard += 1;
break;
case 'e': /* ifexists */
ifexists = 1;
--
1.7.11.7
(5/20/13 8:04 PM), Rafael Aquini wrote:
> Intruduce a new flag to make page-cluster fine-grained discards while swapping
> conditional, as they can be considered detrimental to some setups. However,
> keep allowing batched discards at sys_swapon() time, when enabled by the
> system administrator.
>
> Signed-off-by: Rafael Aquini <[email protected]>
> ---
> include/linux/swap.h | 8 +++++---
> mm/swapfile.c | 12 ++++++++----
> 2 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1701ce4..ab2e742 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -19,10 +19,11 @@ struct bio;
> #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
> #define SWAP_FLAG_PRIO_MASK 0x7fff
> #define SWAP_FLAG_PRIO_SHIFT 0
> -#define SWAP_FLAG_DISCARD 0x10000 /* discard swap cluster after use */
> +#define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap areas */
> +#define SWAP_FLAG_DISCARD_CLUSTER 0x20000 /* discard swap clusters after use */
>From point of backward compatibility view, 0x10000 should be disable both discarding
when mount and when IO.
And, introducing new two flags, enable mount time discard and enable IO time discard.
IOW, Please consider newer kernel and older swapon(8) conbination.
Other than that, looks good to me.
> +.B "\-c, \-\-cluster\-discard"
> +Swapping will discard clusters of swap pages in between freeing them
> +and re-writing to them, if the swap device supports that. This option
> +also implies the
> +.I \-d, \-\-discard
> +swapon flag.
I'm not sure this is good idea. Why can't we make these flags orthogonal?
> /* If true, don't complain if the device/file doesn't exist */
> static int ifexists;
> @@ -570,8 +574,11 @@ static int do_swapon(const char *orig_special, int prio,
> << SWAP_FLAG_PRIO_SHIFT);
> }
> #endif
> - if (fl_discard)
> + if (fl_discard) {
> flags |= SWAP_FLAG_DISCARD;
> + if (fl_discard > 1)
> + flags |= SWAP_FLAG_DISCARD_CLUSTER;
This is not enough, IMHO. When running this code on old kernel, swapon() return EINVAL.
At that time, we should fall back swapon(0x10000).
On Mon, May 20, 2013 at 09:04:25PM -0300, Rafael Aquini wrote:
> - while ((c = getopt_long(argc, argv, "ahdefp:svVL:U:",
> + while ((c = getopt_long(argc, argv, "ahcdefp:svVL:U:",
> long_opts, NULL)) != -1) {
> switch (c) {
> case 'a': /* all */
> @@ -738,8 +753,11 @@ int main(int argc, char *argv[])
> case 'U':
> add_uuid(optarg);
> break;
> + case 'c':
> + discard += 2;
> + break;
> case 'd':
> - discard = 1;
> + discard += 1;
this is fragile, it would be better to use
case 'c':
discard |= SWAP_FLAG_DISCARD_CLUSTER;
break;
case 'd':
discard |= SWAP_FLAG_DISCARD;
break;
and use directly the flags everywhere in the code than use magical
numbers '1' and '2' etc.
Karel
--
Karel Zak <[email protected]>
http://karelzak.blogspot.com
On Mon, May 20, 2013 at 09:02:43PM -0400, KOSAKI Motohiro wrote:
> > - if (fl_discard)
> > + if (fl_discard) {
> > flags |= SWAP_FLAG_DISCARD;
> > + if (fl_discard > 1)
> > + flags |= SWAP_FLAG_DISCARD_CLUSTER;
>
> This is not enough, IMHO. When running this code on old kernel, swapon() return EINVAL.
> At that time, we should fall back swapon(0x10000).
Hmm.. currently we don't use any fallback for any swap flag (e.g.
0x10000) for compatibility with old kernels. Maybe it's better to
keep it simple and stupid and return an error message than introduce
any super-smart semantic to hide incompatible fstab configuration.
Karel
--
Karel Zak <[email protected]>
http://karelzak.blogspot.com
(5/21/13 6:26 AM), Karel Zak wrote:
> On Mon, May 20, 2013 at 09:02:43PM -0400, KOSAKI Motohiro wrote:
>>> - if (fl_discard)
>>> + if (fl_discard) {
>>> flags |= SWAP_FLAG_DISCARD;
>>> + if (fl_discard > 1)
>>> + flags |= SWAP_FLAG_DISCARD_CLUSTER;
>>
>> This is not enough, IMHO. When running this code on old kernel, swapon() return EINVAL.
>> At that time, we should fall back swapon(0x10000).
>
> Hmm.. currently we don't use any fallback for any swap flag (e.g.
> 0x10000) for compatibility with old kernels. Maybe it's better to
> keep it simple and stupid and return an error message than introduce
> any super-smart semantic to hide incompatible fstab configuration.
Hm. If so, I'd propose to revert the following change.
> .B "\-d, \-\-discard"
>-Discard freed swap pages before they are reused, if the swap
>-device supports the discard or trim operation. This may improve
>-performance on some Solid State Devices, but often it does not.
>+Enables swap discards, if the swap device supports that, and performs
>+a batch discard operation for the swap device at swapon time.
And instead, I suggest to make --discard-on-swapon like the following.
(better name idea is welcome)
+--discard-on-swapon
+Enables swap discards, if the swap device supports that, and performs
+a batch discard operation for the swap device at swapon time.
I mean, preserving flags semantics removes the reason we need make a fallback.
Howdy Kosaki-san,
Thanks for your time over this one :)
On Mon, May 20, 2013 at 08:55:33PM -0400, KOSAKI Motohiro wrote:
> (5/20/13 8:04 PM), Rafael Aquini wrote:
> > Intruduce a new flag to make page-cluster fine-grained discards while swapping
> > conditional, as they can be considered detrimental to some setups. However,
> > keep allowing batched discards at sys_swapon() time, when enabled by the
> > system administrator.
> >
> > Signed-off-by: Rafael Aquini <[email protected]>
> > ---
> > include/linux/swap.h | 8 +++++---
> > mm/swapfile.c | 12 ++++++++----
> > 2 files changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 1701ce4..ab2e742 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -19,10 +19,11 @@ struct bio;
> > #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
> > #define SWAP_FLAG_PRIO_MASK 0x7fff
> > #define SWAP_FLAG_PRIO_SHIFT 0
> > -#define SWAP_FLAG_DISCARD 0x10000 /* discard swap cluster after use */
> > +#define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap areas */
> > +#define SWAP_FLAG_DISCARD_CLUSTER 0x20000 /* discard swap clusters after use */
>
> From point of backward compatibility view, 0x10000 should be disable both discarding
> when mount and when IO.
I think you mean 0x10000 should be enable both here, then. That's a nice catch. I'll
try to think a way to accomplish it in a simple fashion.
> And, introducing new two flags, enable mount time discard and enable IO time discard.
>
> IOW, Please consider newer kernel and older swapon(8) conbination.
> Other than that, looks good to me.
>
>
Karel, Motohiro,
Thanks a lot for your time reviewing this patch and providing me with valuable
feedback.
On Tue, May 21, 2013 at 04:17:04PM -0400, KOSAKI Motohiro wrote:
> (5/21/13 6:26 AM), Karel Zak wrote:
> > On Mon, May 20, 2013 at 09:02:43PM -0400, KOSAKI Motohiro wrote:
> >>> - if (fl_discard)
> >>> + if (fl_discard) {
> >>> flags |= SWAP_FLAG_DISCARD;
> >>> + if (fl_discard > 1)
> >>> + flags |= SWAP_FLAG_DISCARD_CLUSTER;
> >>
> >> This is not enough, IMHO. When running this code on old kernel, swapon() return EINVAL.
> >> At that time, we should fall back swapon(0x10000).
> >
> > Hmm.. currently we don't use any fallback for any swap flag (e.g.
> > 0x10000) for compatibility with old kernels. Maybe it's better to
> > keep it simple and stupid and return an error message than introduce
> > any super-smart semantic to hide incompatible fstab configuration.
>
> Hm. If so, I'd propose to revert the following change.
>
> > .B "\-d, \-\-discard"
> >-Discard freed swap pages before they are reused, if the swap
> >-device supports the discard or trim operation. This may improve
> >-performance on some Solid State Devices, but often it does not.
> >+Enables swap discards, if the swap device supports that, and performs
> >+a batch discard operation for the swap device at swapon time.
>
>
> And instead, I suggest to make --discard-on-swapon like the following.
> (better name idea is welcome)
>
> +--discard-on-swapon
> +Enables swap discards, if the swap device supports that, and performs
> +a batch discard operation for the swap device at swapon time.
>
> I mean, preserving flags semantics removes the reason we need make a fallback.
>
>
Instead of reverting and renaming --discard, what about making it accept an
optional argument, so we could use --discard (to enable all thing and keep
backward compatibility); --discard=cluster & --discard=batch (or whatever we
think it should be named). I'll try to sort this approach out if you folks think
it's worthwhile.
-- Rafael
> Instead of reverting and renaming --discard, what about making it accept an
> optional argument, so we could use --discard (to enable all thing and keep
> backward compatibility); --discard=cluster & --discard=batch (or whatever we
> think it should be named). I'll try to sort this approach out if you folks think
> it's worthwhile.
Optional argument looks nice, at least to me.
But hmm..
"cluster" and "batch" describes current kernel implementation, not user visible effect.
Usually I suggest to pick up a word from man pages because it describe user visible action.
e.g. --discard=freed-pages or --discard=io or --discard=swapon or --discard=once, etc..
But this is not strong opinion. You can ignore it. I don't think I have good English sense. :-)
KOSAKI Motohiro <[email protected]> writes:
>> Instead of reverting and renaming --discard, what about making it accept an
>> optional argument, so we could use --discard (to enable all thing and keep
>> backward compatibility); --discard=cluster & --discard=batch (or whatever we
>> think it should be named). I'll try to sort this approach out if you folks think
>> it's worthwhile.
>
> Optional argument looks nice, at least to me.
>
> But hmm..
>
> "cluster" and "batch" describes current kernel implementation, not user visible effect.
> Usually I suggest to pick up a word from man pages because it describe user visible action.
> e.g. --discard=freed-pages or --discard=io or --discard=swapon or --discard=once, etc..
>
> But this is not strong opinion. You can ignore it. I don't think I have good English sense. :-)
I like discard=swapon. For the fine-grained discards, I don't have a
strong opinion, but I guess I'd lean towards freed-pages.
Cheers,
Jeff