I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
suspect my setup erroneously executes two swapoff+cryptsetup close
operations simultaneously, so a race condition is triggered.
I am using a single swap on a plain dm-crypt device on a MBR partition
on a SATA drive.
I think the problem is probably related to
b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
Thanks,
Alex.
On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) <[email protected]> wrote:
>
> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
> suspect my setup erroneously executes two swapoff+cryptsetup close
> operations simultaneously, so a race condition is triggered.
>
> I am using a single swap on a plain dm-crypt device on a MBR partition
> on a SATA drive.
>
> I think the problem is probably related to
> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
>
Could you please provide more information on this - stack trace, dmesg etc?
Is it easily reproducible? If yes, please detail the steps so that I
can try it inhouse.
Thanks,
Vineeth
Excerpts from Vineeth Pillai's message of March 25, 2019 6:08 pm:
> On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) <[email protected]> wrote:
>>
>> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
>> suspect my setup erroneously executes two swapoff+cryptsetup close
>> operations simultaneously, so a race condition is triggered.
>>
>> I am using a single swap on a plain dm-crypt device on a MBR partition
>> on a SATA drive.
>>
>> I think the problem is probably related to
>> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
>>
> Could you please provide more information on this - stack trace, dmesg etc?
> Is it easily reproducible? If yes, please detail the steps so that I
> can try it inhouse.
>
> Thanks,
> Vineeth
>
Some info from the BUG entry (I didn't bother to type it all,
low-quality image available upon request):
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
#PF error: [normal kernel read fault]
PGD 0 P4D 0
Oops: 0000 [#1] SMP
CPU: 0 Comm: swapoff Not tainted 5.1.0-rc1+ #2
RIP: 0010:shmem_recalc_inode+0x41/0x90
Call Trace:
? shmem_undo_range
? rb_erase_cached
? set_next_entity
? __inode_wait_for_writeback
? shmem_truncate_range
? shmem_evict_inode
? evict
? shmem_unuse
? try_to_unuse
? swapcache_free_entries
? _cond_resched
? __se_sys_swapoff
? do_syscall_64
? entry_SYSCALL_64_after_hwframe
As I said, it only occurs occasionally on shutdown. I think it is a safe
guess that it can only occur when the swap is not empty, but possibly
other conditions are necessary, so I will test further.
On Sun, 31 Mar 2019, Alex Xu (Hello71) wrote:
> Excerpts from Vineeth Pillai's message of March 25, 2019 6:08 pm:
> > On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) <[email protected]> wrote:
> >>
> >> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
> >> suspect my setup erroneously executes two swapoff+cryptsetup close
> >> operations simultaneously, so a race condition is triggered.
> >>
> >> I am using a single swap on a plain dm-crypt device on a MBR partition
> >> on a SATA drive.
> >>
> >> I think the problem is probably related to
> >> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
> >>
> > Could you please provide more information on this - stack trace, dmesg etc?
> > Is it easily reproducible? If yes, please detail the steps so that I
> > can try it inhouse.
> >
> > Thanks,
> > Vineeth
> >
>
> Some info from the BUG entry (I didn't bother to type it all,
> low-quality image available upon request):
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> #PF error: [normal kernel read fault]
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP
> CPU: 0 Comm: swapoff Not tainted 5.1.0-rc1+ #2
> RIP: 0010:shmem_recalc_inode+0x41/0x90
>
> Call Trace:
> ? shmem_undo_range
> ? rb_erase_cached
> ? set_next_entity
> ? __inode_wait_for_writeback
> ? shmem_truncate_range
> ? shmem_evict_inode
> ? evict
> ? shmem_unuse
> ? try_to_unuse
> ? swapcache_free_entries
> ? _cond_resched
> ? __se_sys_swapoff
> ? do_syscall_64
> ? entry_SYSCALL_64_after_hwframe
>
> As I said, it only occurs occasionally on shutdown. I think it is a safe
> guess that it can only occur when the swap is not empty, but possibly
> other conditions are necessary, so I will test further.
Thanks for the update, Alex. I'm looking into a couple of bugs with the
5.1-rc swapoff, but this one doesn't look like anything I know so far.
shmem_recalc_inode() is a surprising place to crash: it's as if the
igrab() in shmem_unuse() were not working.
Yes, please do send Vineeth and me (or the lists) your low-quality image,
in case we can extract any more info from it; and also please the
disassembly of your kernel's shmem_recalc_inode(), so we can be sure of
exactly what it's crashing on (though I expect that will leave me as
puzzled as before).
If you want to experiment with one of my fixes, not yet written up and
posted, just try changing SWAP_UNUSE_MAX_TRIES in mm/swapfile.c from
3 to INT_MAX: I don't see how that issue could manifest as crashing in
shmem_recalc_inode(), but I may just be too stupid to see it.
Thanks,
Hugh
On Sun, 31 Mar 2019, Hugh Dickins wrote:
> On Sun, 31 Mar 2019, Alex Xu (Hello71) wrote:
> > Excerpts from Vineeth Pillai's message of March 25, 2019 6:08 pm:
> > > On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) <[email protected]> wrote:
> > >>
> > >> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
> > >> suspect my setup erroneously executes two swapoff+cryptsetup close
> > >> operations simultaneously, so a race condition is triggered.
> > >>
> > >> I am using a single swap on a plain dm-crypt device on a MBR partition
> > >> on a SATA drive.
> > >>
> > >> I think the problem is probably related to
> > >> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
> > >>
> > > Could you please provide more information on this - stack trace, dmesg etc?
> > > Is it easily reproducible? If yes, please detail the steps so that I
> > > can try it inhouse.
> > >
> > > Thanks,
> > > Vineeth
> > >
> >
> > Some info from the BUG entry (I didn't bother to type it all,
> > low-quality image available upon request):
> >
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > #PF error: [normal kernel read fault]
> > PGD 0 P4D 0
> > Oops: 0000 [#1] SMP
> > CPU: 0 Comm: swapoff Not tainted 5.1.0-rc1+ #2
> > RIP: 0010:shmem_recalc_inode+0x41/0x90
> >
> > Call Trace:
> > ? shmem_undo_range
> > ? rb_erase_cached
> > ? set_next_entity
> > ? __inode_wait_for_writeback
> > ? shmem_truncate_range
> > ? shmem_evict_inode
> > ? evict
> > ? shmem_unuse
> > ? try_to_unuse
> > ? swapcache_free_entries
> > ? _cond_resched
> > ? __se_sys_swapoff
> > ? do_syscall_64
> > ? entry_SYSCALL_64_after_hwframe
> >
> > As I said, it only occurs occasionally on shutdown. I think it is a safe
> > guess that it can only occur when the swap is not empty, but possibly
> > other conditions are necessary, so I will test further.
>
> Thanks for the update, Alex. I'm looking into a couple of bugs with the
> 5.1-rc swapoff, but this one doesn't look like anything I know so far.
> shmem_recalc_inode() is a surprising place to crash: it's as if the
> igrab() in shmem_unuse() were not working.
>
> Yes, please do send Vineeth and me (or the lists) your low-quality image,
> in case we can extract any more info from it; and also please the
> disassembly of your kernel's shmem_recalc_inode(), so we can be sure of
> exactly what it's crashing on (though I expect that will leave me as
> puzzled as before).
>
> If you want to experiment with one of my fixes, not yet written up and
> posted, just try changing SWAP_UNUSE_MAX_TRIES in mm/swapfile.c from
> 3 to INT_MAX: I don't see how that issue could manifest as crashing in
> shmem_recalc_inode(), but I may just be too stupid to see it.
Thanks for the image and disassembly you sent: which showed that the
ffffffff81117351: 48 83 3f 00 cmpq $0x0,(%rdi)
you are crashing on, is the "if (sbinfo->max_blocks)" in the inlined
shmem_inode_unacct_blocks(): inode->i_sb->s_fs_info is NULL, which is
something that shmem_put_super() does.
Eight-year-old memories stirred: I knew when looking at Vineeth's patch,
that I ought to look back through the history of mm/shmem.c, to check
some points that Konstantin Khlebnikov had made years ago, that
surprised me then and were in danger of surprising us again with this
rework. But I failed to do so: thank you Alex, for reporting this bug
and pointing us back there.
igrab() protects from eviction but does not protect from unmounting.
I bet that is what you are hitting, though I've not even read through
2.6.39's 778dd893ae785 ("tmpfs: fix race between umount and swapoff")
again yet, and not begun to think of the fix for it this time around;
but wanted to let you know that this bug is now (probably) identified.
Hugh
On Tue, 2 Apr 2019, Hugh Dickins wrote:
> On Sun, 31 Mar 2019, Hugh Dickins wrote:
> > On Sun, 31 Mar 2019, Alex Xu (Hello71) wrote:
> > > Excerpts from Vineeth Pillai's message of March 25, 2019 6:08 pm:
> > > > On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) <[email protected]> wrote:
> > > >>
> > > >> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
> > > >> suspect my setup erroneously executes two swapoff+cryptsetup close
> > > >> operations simultaneously, so a race condition is triggered.
> > > >>
> > > >> I am using a single swap on a plain dm-crypt device on a MBR partition
> > > >> on a SATA drive.
> > > >>
> > > >> I think the problem is probably related to
> > > >> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
> > > >>
> > > > Could you please provide more information on this - stack trace, dmesg etc?
> > > > Is it easily reproducible? If yes, please detail the steps so that I
> > > > can try it inhouse.
> > > >
> > > > Thanks,
> > > > Vineeth
> > > >
> > >
> > > Some info from the BUG entry (I didn't bother to type it all,
> > > low-quality image available upon request):
> > >
> > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > > #PF error: [normal kernel read fault]
> > > PGD 0 P4D 0
> > > Oops: 0000 [#1] SMP
> > > CPU: 0 Comm: swapoff Not tainted 5.1.0-rc1+ #2
> > > RIP: 0010:shmem_recalc_inode+0x41/0x90
> > >
> > > Call Trace:
> > > ? shmem_undo_range
> > > ? rb_erase_cached
> > > ? set_next_entity
> > > ? __inode_wait_for_writeback
> > > ? shmem_truncate_range
> > > ? shmem_evict_inode
> > > ? evict
> > > ? shmem_unuse
> > > ? try_to_unuse
> > > ? swapcache_free_entries
> > > ? _cond_resched
> > > ? __se_sys_swapoff
> > > ? do_syscall_64
> > > ? entry_SYSCALL_64_after_hwframe
> > >
> > > As I said, it only occurs occasionally on shutdown. I think it is a safe
> > > guess that it can only occur when the swap is not empty, but possibly
> > > other conditions are necessary, so I will test further.
> >
> > Thanks for the update, Alex. I'm looking into a couple of bugs with the
> > 5.1-rc swapoff, but this one doesn't look like anything I know so far.
> > shmem_recalc_inode() is a surprising place to crash: it's as if the
> > igrab() in shmem_unuse() were not working.
> >
> > Yes, please do send Vineeth and me (or the lists) your low-quality image,
> > in case we can extract any more info from it; and also please the
> > disassembly of your kernel's shmem_recalc_inode(), so we can be sure of
> > exactly what it's crashing on (though I expect that will leave me as
> > puzzled as before).
> >
> > If you want to experiment with one of my fixes, not yet written up and
> > posted, just try changing SWAP_UNUSE_MAX_TRIES in mm/swapfile.c from
> > 3 to INT_MAX: I don't see how that issue could manifest as crashing in
> > shmem_recalc_inode(), but I may just be too stupid to see it.
>
> Thanks for the image and disassembly you sent: which showed that the
> ffffffff81117351: 48 83 3f 00 cmpq $0x0,(%rdi)
> you are crashing on, is the "if (sbinfo->max_blocks)" in the inlined
> shmem_inode_unacct_blocks(): inode->i_sb->s_fs_info is NULL, which is
> something that shmem_put_super() does.
>
> Eight-year-old memories stirred: I knew when looking at Vineeth's patch,
> that I ought to look back through the history of mm/shmem.c, to check
> some points that Konstantin Khlebnikov had made years ago, that
> surprised me then and were in danger of surprising us again with this
> rework. But I failed to do so: thank you Alex, for reporting this bug
> and pointing us back there.
>
> igrab() protects from eviction but does not protect from unmounting.
> I bet that is what you are hitting, though I've not even read through
> 2.6.39's 778dd893ae785 ("tmpfs: fix race between umount and swapoff")
> again yet, and not begun to think of the fix for it this time around;
> but wanted to let you know that this bug is now (probably) identified.
Hi Alex, could you please give the patch below a try? It fixes a
problem, but I'm not sure that it's your problem - please let us know.
I've not yet written up the commit description, and this should end up
as 4/4 in a series fixing several new swapoff issues: I'll wait to post
the finished series until heard back from you.
I did first try following the suggestion Konstantin had made back then,
for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active).
But it turned out to be difficult to get right in shmem_unuse(), because
of the way that relies on the inode as a cursor in the list - problem
when you've acquired an s_active reference, but fail to acquire inode
reference, and cannot safely release the s_active reference while still
holding the swaplist mutex.
If VFS offered an isgrab(inode), like igrab() but acquiring s_active
reference while holding i_lock, that would drop very easily into the
current shmem_unuse() as a replacement there for igrab(). But the rest
of the world has managed without that for years, so I'm disinclined to
add it just for this. And the patch below seems good enough without it.
Thanks,
Hugh
---
include/linux/shmem_fs.h | 1 +
mm/shmem.c | 39 ++++++++++++++++++---------------------
2 files changed, 19 insertions(+), 21 deletions(-)
--- 5.1-rc3/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700
+++ linux/include/linux/shmem_fs.h 2019-04-04 16:18:08.193512968 -0700
@@ -21,6 +21,7 @@ struct shmem_inode_info {
struct list_head swaplist; /* chain of maybes on swap */
struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
+ atomic_t stop_eviction; /* hold when working on inode */
struct inode vfs_inode;
};
--- 5.1-rc3/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700
+++ linux/mm/shmem.c 2019-04-04 16:18:08.193512968 -0700
@@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
}
spin_unlock(&sbinfo->shrinklist_lock);
}
- if (!list_empty(&info->swaplist)) {
+ while (!list_empty(&info->swaplist)) {
+ /* Wait while shmem_unuse() is scanning this inode... */
+ wait_var_event(&info->stop_eviction,
+ !atomic_read(&info->stop_eviction));
mutex_lock(&shmem_swaplist_mutex);
list_del_init(&info->swaplist);
+ /* ...but beware of the race if we peeked too early */
+ if (!atomic_read(&info->stop_eviction))
+ list_del_init(&info->swaplist);
mutex_unlock(&shmem_swaplist_mutex);
}
}
@@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
unsigned long *fs_pages_to_unuse)
{
struct shmem_inode_info *info, *next;
- struct inode *inode;
- struct inode *prev_inode = NULL;
int error = 0;
if (list_empty(&shmem_swaplist))
return 0;
mutex_lock(&shmem_swaplist_mutex);
-
- /*
- * The extra refcount on the inode is necessary to safely dereference
- * p->next after re-acquiring the lock. New shmem inodes with swap
- * get added to the end of the list and we will scan them all.
- */
list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
if (!info->swapped) {
list_del_init(&info->swaplist);
continue;
}
-
- inode = igrab(&info->vfs_inode);
- if (!inode)
- continue;
-
+ /*
+ * Drop the swaplist mutex while searching the inode for swap;
+ * but before doing so, make sure shmem_evict_inode() will not
+ * remove placeholder inode from swaplist, nor let it be freed
+ * (igrab() would protect from unlink, but not from unmount).
+ */
+ atomic_inc(&info->stop_eviction);
mutex_unlock(&shmem_swaplist_mutex);
- if (prev_inode)
- iput(prev_inode);
- prev_inode = inode;
- error = shmem_unuse_inode(inode, type, frontswap,
+ error = shmem_unuse_inode(&info->vfs_inode, type, frontswap,
fs_pages_to_unuse);
cond_resched();
@@ -1264,14 +1261,13 @@ int shmem_unuse(unsigned int type, bool
next = list_next_entry(info, swaplist);
if (!info->swapped)
list_del_init(&info->swaplist);
+ if (atomic_dec_and_test(&info->stop_eviction))
+ wake_up_var(&info->stop_eviction);
if (error)
break;
}
mutex_unlock(&shmem_swaplist_mutex);
- if (prev_inode)
- iput(prev_inode);
-
return error;
}
@@ -2238,6 +2234,7 @@ static struct inode *shmem_get_inode(str
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
+ atomic_set(&info->stop_eviction, 0);
info->seals = F_SEAL_SEAL;
info->flags = flags & VM_NORESERVE;
INIT_LIST_HEAD(&info->shrinklist);
On 05.04.2019 5:12, Hugh Dickins wrote:
> On Tue, 2 Apr 2019, Hugh Dickins wrote:
>> On Sun, 31 Mar 2019, Hugh Dickins wrote:
>>> On Sun, 31 Mar 2019, Alex Xu (Hello71) wrote:
>>>> Excerpts from Vineeth Pillai's message of March 25, 2019 6:08 pm:
>>>>> On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) <[email protected]> wrote:
>>>>>>
>>>>>> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
>>>>>> suspect my setup erroneously executes two swapoff+cryptsetup close
>>>>>> operations simultaneously, so a race condition is triggered.
>>>>>>
>>>>>> I am using a single swap on a plain dm-crypt device on a MBR partition
>>>>>> on a SATA drive.
>>>>>>
>>>>>> I think the problem is probably related to
>>>>>> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
>>>>>>
>>>>> Could you please provide more information on this - stack trace, dmesg etc?
>>>>> Is it easily reproducible? If yes, please detail the steps so that I
>>>>> can try it inhouse.
>>>>>
>>>>> Thanks,
>>>>> Vineeth
>>>>>
>>>>
>>>> Some info from the BUG entry (I didn't bother to type it all,
>>>> low-quality image available upon request):
>>>>
>>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>>>> #PF error: [normal kernel read fault]
>>>> PGD 0 P4D 0
>>>> Oops: 0000 [#1] SMP
>>>> CPU: 0 Comm: swapoff Not tainted 5.1.0-rc1+ #2
>>>> RIP: 0010:shmem_recalc_inode+0x41/0x90
>>>>
>>>> Call Trace:
>>>> ? shmem_undo_range
>>>> ? rb_erase_cached
>>>> ? set_next_entity
>>>> ? __inode_wait_for_writeback
>>>> ? shmem_truncate_range
>>>> ? shmem_evict_inode
>>>> ? evict
>>>> ? shmem_unuse
>>>> ? try_to_unuse
>>>> ? swapcache_free_entries
>>>> ? _cond_resched
>>>> ? __se_sys_swapoff
>>>> ? do_syscall_64
>>>> ? entry_SYSCALL_64_after_hwframe
>>>>
>>>> As I said, it only occurs occasionally on shutdown. I think it is a safe
>>>> guess that it can only occur when the swap is not empty, but possibly
>>>> other conditions are necessary, so I will test further.
>>>
>>> Thanks for the update, Alex. I'm looking into a couple of bugs with the
>>> 5.1-rc swapoff, but this one doesn't look like anything I know so far.
>>> shmem_recalc_inode() is a surprising place to crash: it's as if the
>>> igrab() in shmem_unuse() were not working.
>>>
>>> Yes, please do send Vineeth and me (or the lists) your low-quality image,
>>> in case we can extract any more info from it; and also please the
>>> disassembly of your kernel's shmem_recalc_inode(), so we can be sure of
>>> exactly what it's crashing on (though I expect that will leave me as
>>> puzzled as before).
>>>
>>> If you want to experiment with one of my fixes, not yet written up and
>>> posted, just try changing SWAP_UNUSE_MAX_TRIES in mm/swapfile.c from
>>> 3 to INT_MAX: I don't see how that issue could manifest as crashing in
>>> shmem_recalc_inode(), but I may just be too stupid to see it.
>>
>> Thanks for the image and disassembly you sent: which showed that the
>> ffffffff81117351: 48 83 3f 00 cmpq $0x0,(%rdi)
>> you are crashing on, is the "if (sbinfo->max_blocks)" in the inlined
>> shmem_inode_unacct_blocks(): inode->i_sb->s_fs_info is NULL, which is
>> something that shmem_put_super() does.
>>
>> Eight-year-old memories stirred: I knew when looking at Vineeth's patch,
>> that I ought to look back through the history of mm/shmem.c, to check
>> some points that Konstantin Khlebnikov had made years ago, that
>> surprised me then and were in danger of surprising us again with this
>> rework. But I failed to do so: thank you Alex, for reporting this bug
>> and pointing us back there.
>>
>> igrab() protects from eviction but does not protect from unmounting.
>> I bet that is what you are hitting, though I've not even read through
>> 2.6.39's 778dd893ae785 ("tmpfs: fix race between umount and swapoff")
>> again yet, and not begun to think of the fix for it this time around;
>> but wanted to let you know that this bug is now (probably) identified.
>
> Hi Alex, could you please give the patch below a try? It fixes a
> problem, but I'm not sure that it's your problem - please let us know.
>
> I've not yet written up the commit description, and this should end up
> as 4/4 in a series fixing several new swapoff issues: I'll wait to post
> the finished series until heard back from you.
>
> I did first try following the suggestion Konstantin had made back then,
> for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active).
>
> But it turned out to be difficult to get right in shmem_unuse(), because
> of the way that relies on the inode as a cursor in the list - problem
> when you've acquired an s_active reference, but fail to acquire inode
> reference, and cannot safely release the s_active reference while still
> holding the swaplist mutex.
>
> If VFS offered an isgrab(inode), like igrab() but acquiring s_active
> reference while holding i_lock, that would drop very easily into the
> current shmem_unuse() as a replacement there for igrab(). But the rest
> of the world has managed without that for years, so I'm disinclined to
> add it just for this. And the patch below seems good enough without it.
>
> Thanks,
> Hugh
>
> ---
>
> include/linux/shmem_fs.h | 1 +
> mm/shmem.c | 39 ++++++++++++++++++---------------------
> 2 files changed, 19 insertions(+), 21 deletions(-)
>
> --- 5.1-rc3/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700
> +++ linux/include/linux/shmem_fs.h 2019-04-04 16:18:08.193512968 -0700
> @@ -21,6 +21,7 @@ struct shmem_inode_info {
> struct list_head swaplist; /* chain of maybes on swap */
> struct shared_policy policy; /* NUMA memory alloc policy */
> struct simple_xattrs xattrs; /* list of xattrs */
> + atomic_t stop_eviction; /* hold when working on inode */
> struct inode vfs_inode;
> };
>
> --- 5.1-rc3/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700
> +++ linux/mm/shmem.c 2019-04-04 16:18:08.193512968 -0700
> @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
> }
> spin_unlock(&sbinfo->shrinklist_lock);
> }
> - if (!list_empty(&info->swaplist)) {
> + while (!list_empty(&info->swaplist)) {
> + /* Wait while shmem_unuse() is scanning this inode... */
> + wait_var_event(&info->stop_eviction,
> + !atomic_read(&info->stop_eviction));
> mutex_lock(&shmem_swaplist_mutex);
> list_del_init(&info->swaplist);
> + /* ...but beware of the race if we peeked too early */
> + if (!atomic_read(&info->stop_eviction))
> + list_del_init(&info->swaplist);
> mutex_unlock(&shmem_swaplist_mutex);
> }
> }
> @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
> unsigned long *fs_pages_to_unuse)
> {
> struct shmem_inode_info *info, *next;
> - struct inode *inode;
> - struct inode *prev_inode = NULL;
> int error = 0;
>
> if (list_empty(&shmem_swaplist))
> return 0;
>
> mutex_lock(&shmem_swaplist_mutex);
> -
> - /*
> - * The extra refcount on the inode is necessary to safely dereference
> - * p->next after re-acquiring the lock. New shmem inodes with swap
> - * get added to the end of the list and we will scan them all.
> - */
> list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
> if (!info->swapped) {
> list_del_init(&info->swaplist);
> continue;
> }
> -
> - inode = igrab(&info->vfs_inode);
> - if (!inode)
> - continue;
> -
> + /*
> + * Drop the swaplist mutex while searching the inode for swap;
> + * but before doing so, make sure shmem_evict_inode() will not
> + * remove placeholder inode from swaplist, nor let it be freed
> + * (igrab() would protect from unlink, but not from unmount).
> + */
> + atomic_inc(&info->stop_eviction);
> mutex_unlock(&shmem_swaplist_mutex);
> - if (prev_inode)
> - iput(prev_inode);
> - prev_inode = inode;
This seems too ad hoc solution.
Superblock could be protected with s_umount,
in same way as writeback pins it in __writeback_inodes_wb()
Please see (completely untested) patch in attachment.
>
> - error = shmem_unuse_inode(inode, type, frontswap,
> + error = shmem_unuse_inode(&info->vfs_inode, type, frontswap,
> fs_pages_to_unuse);
> cond_resched();
>
> @@ -1264,14 +1261,13 @@ int shmem_unuse(unsigned int type, bool
> next = list_next_entry(info, swaplist);
> if (!info->swapped)
> list_del_init(&info->swaplist);
> + if (atomic_dec_and_test(&info->stop_eviction))
> + wake_up_var(&info->stop_eviction);
> if (error)
> break;
> }
> mutex_unlock(&shmem_swaplist_mutex);
>
> - if (prev_inode)
> - iput(prev_inode);
> -
> return error;
> }
>
> @@ -2238,6 +2234,7 @@ static struct inode *shmem_get_inode(str
> info = SHMEM_I(inode);
> memset(info, 0, (char *)inode - (char *)info);
> spin_lock_init(&info->lock);
> + atomic_set(&info->stop_eviction, 0);
> info->seals = F_SEAL_SEAL;
> info->flags = flags & VM_NORESERVE;
> INIT_LIST_HEAD(&info->shrinklist);
>
On Fri, 5 Apr 2019, Konstantin Khlebnikov wrote:
> On 05.04.2019 5:12, Hugh Dickins wrote:
> > Hi Alex, could you please give the patch below a try? It fixes a
> > problem, but I'm not sure that it's your problem - please let us know.
> >
> > I've not yet written up the commit description, and this should end up
> > as 4/4 in a series fixing several new swapoff issues: I'll wait to post
> > the finished series until heard back from you.
> >
> > I did first try following the suggestion Konstantin had made back then,
> > for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active).
> >
> > But it turned out to be difficult to get right in shmem_unuse(), because
> > of the way that relies on the inode as a cursor in the list - problem
> > when you've acquired an s_active reference, but fail to acquire inode
> > reference, and cannot safely release the s_active reference while still
> > holding the swaplist mutex.
> >
> > If VFS offered an isgrab(inode), like igrab() but acquiring s_active
> > reference while holding i_lock, that would drop very easily into the
> > current shmem_unuse() as a replacement there for igrab(). But the rest
> > of the world has managed without that for years, so I'm disinclined to
> > add it just for this. And the patch below seems good enough without it.
> >
> > Thanks,
> > Hugh
> >
> > ---
> >
> > include/linux/shmem_fs.h | 1 +
> > mm/shmem.c | 39 ++++++++++++++++++---------------------
> > 2 files changed, 19 insertions(+), 21 deletions(-)
> >
> > --- 5.1-rc3/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700
> > +++ linux/include/linux/shmem_fs.h 2019-04-04 16:18:08.193512968 -0700
> > @@ -21,6 +21,7 @@ struct shmem_inode_info {
> > struct list_head swaplist; /* chain of maybes on swap */
> > struct shared_policy policy; /* NUMA memory alloc policy
> > */
> > struct simple_xattrs xattrs; /* list of xattrs */
> > + atomic_t stop_eviction; /* hold when working on inode
> > */
> > struct inode vfs_inode;
> > };
> > --- 5.1-rc3/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700
> > +++ linux/mm/shmem.c 2019-04-04 16:18:08.193512968 -0700
> > @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
> > }
> > spin_unlock(&sbinfo->shrinklist_lock);
> > }
> > - if (!list_empty(&info->swaplist)) {
> > + while (!list_empty(&info->swaplist)) {
> > + /* Wait while shmem_unuse() is scanning this inode...
> > */
> > + wait_var_event(&info->stop_eviction,
> > + !atomic_read(&info->stop_eviction));
> > mutex_lock(&shmem_swaplist_mutex);
> > list_del_init(&info->swaplist);
> > + /* ...but beware of the race if we peeked too early
> > */
> > + if (!atomic_read(&info->stop_eviction))
> > + list_del_init(&info->swaplist);
> > mutex_unlock(&shmem_swaplist_mutex);
> > }
> > }
> > @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
> > unsigned long *fs_pages_to_unuse)
> > {
> > struct shmem_inode_info *info, *next;
> > - struct inode *inode;
> > - struct inode *prev_inode = NULL;
> > int error = 0;
> > if (list_empty(&shmem_swaplist))
> > return 0;
> > mutex_lock(&shmem_swaplist_mutex);
> > -
> > - /*
> > - * The extra refcount on the inode is necessary to safely dereference
> > - * p->next after re-acquiring the lock. New shmem inodes with swap
> > - * get added to the end of the list and we will scan them all.
> > - */
> > list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
> > if (!info->swapped) {
> > list_del_init(&info->swaplist);
> > continue;
> > }
> > -
> > - inode = igrab(&info->vfs_inode);
> > - if (!inode)
> > - continue;
> > -
> > + /*
> > + * Drop the swaplist mutex while searching the inode for
> > swap;
> > + * but before doing so, make sure shmem_evict_inode() will
> > not
> > + * remove placeholder inode from swaplist, nor let it be
> > freed
> > + * (igrab() would protect from unlink, but not from unmount).
> > + */
> > + atomic_inc(&info->stop_eviction);
> > mutex_unlock(&shmem_swaplist_mutex);
> > - if (prev_inode)
> > - iput(prev_inode);
> > - prev_inode = inode;
> This seems too ad hoc solution.
I see what you mean by "ad hoc", but disagree with "too" ad hoc:
it's an appropriate solution, and a general one - I didn't invent it
for this, but for the huge tmpfs recoveries work items four years ago;
just changed the name from "info->recoveries" to "info->stop_eviction"
to let it be generalized to this swapoff case.
I prefer mine, since it simplifies shmem_unuse() (no igrab!), and has
the nice (but admittedly not essential) property of letting swapoff
proceed without delay and without unnecessary locking on unmounting
filesystems and evicting inodes. (Would I prefer to use the s_umount
technique for my recoveries case? I think not.)
But yours should work too, with a slight change - see comments below,
where I've inlined yours. I'd better get on and post my four fixes
tomorrow, whether or not they fix Alex's case; then if people prefer
yours to my 4/4, yours can be swapped in instead.
> shmem: fix race between shmem_unuse and umount
>
> From: Konstantin Khlebnikov <[email protected]>
>
> Function shmem_unuse could race with generic_shutdown_super.
> Inode reference is not enough for preventing umount and freeing superblock.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---
> mm/shmem.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b3db3779a30a..2018a9a96bb7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1218,6 +1218,10 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type,
> return ret;
> }
>
> +static void shmem_synchronize_umount(struct super_block *sb, void *arg)
> +{
> +}
> +
I think this can go away, see below.
> /*
> * Read all the shared memory data that resides in the swap
> * device 'type' back into memory, so the swap device can be
> @@ -1229,6 +1233,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
> struct shmem_inode_info *info, *next;
> struct inode *inode;
> struct inode *prev_inode = NULL;
> + struct super_block *sb;
> int error = 0;
>
> if (list_empty(&shmem_swaplist))
> @@ -1247,9 +1252,22 @@ int shmem_unuse(unsigned int type, bool frontswap,
> continue;
> }
>
> + /*
> + * Lock superblock to prevent umount and freeing it under us.
> + * If umount in progress it will free swap enties.
> + *
> + * Must be done before grabbing inode reference, otherwise
> + * generic_shutdown_super() will complain about busy inodes.
> + */
> + sb = info->vfs_inode.i_sb;
> + if (!trylock_super(sb))
Right, trylock important there.
> + continue;
> +
> inode = igrab(&info->vfs_inode);
> - if (!inode)
> + if (!inode) {
> + up_read(&sb->s_umount);
Yes, that indeed avoids the difficulty I had with when to call
deactivate_super(), that put me off trying to use s_active.
> continue;
> + }
>
> mutex_unlock(&shmem_swaplist_mutex);
> if (prev_inode)
> @@ -1258,6 +1276,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
>
> error = shmem_unuse_inode(inode, type, frontswap,
> fs_pages_to_unuse);
> + up_read(&sb->s_umount);
No, not here. I think you have to note prev_sb, and then only
up_read(&prev_sb->s_umount) after each iput(prev_inode): otherwise
there's still a risk of "Self-destruct in 5 seconds", isn't there?
> cond_resched();
>
> mutex_lock(&shmem_swaplist_mutex);
> @@ -1272,6 +1291,9 @@ int shmem_unuse(unsigned int type, bool frontswap,
> if (prev_inode)
> iput(prev_inode);
>
> + /* Wait for umounts, this grabs s_umount for each superblock. */
> + iterate_supers_type(&shmem_fs_type, shmem_synchronize_umount, NULL);
> +
I guess that's an attempt to compensate for the somewhat unsatisfactory
trylock above (bearing in mind the SWAP_UNUSE_MAX_TRIES 3, but I remove
that in my 2/4). Nice idea, and if it had the effect of never needing to
retry shmem_unuse(), I'd say yes; but since you're still passing over
un-igrab()-able inodes without an equivalent synchronization, I think
this odd iterate_supers_type() just delays swapoff without buying any
guarantee: better just deleted to keep your patch simpler.
> return error;
> }
>
Thanks,
Hugh
On 08.04.2019 9:05, Hugh Dickins wrote:
> On Fri, 5 Apr 2019, Konstantin Khlebnikov wrote:
>> On 05.04.2019 5:12, Hugh Dickins wrote:
>>> Hi Alex, could you please give the patch below a try? It fixes a
>>> problem, but I'm not sure that it's your problem - please let us know.
>>>
>>> I've not yet written up the commit description, and this should end up
>>> as 4/4 in a series fixing several new swapoff issues: I'll wait to post
>>> the finished series until heard back from you.
>>>
>>> I did first try following the suggestion Konstantin had made back then,
>>> for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active).
>>>
>>> But it turned out to be difficult to get right in shmem_unuse(), because
>>> of the way that relies on the inode as a cursor in the list - problem
>>> when you've acquired an s_active reference, but fail to acquire inode
>>> reference, and cannot safely release the s_active reference while still
>>> holding the swaplist mutex.
>>>
>>> If VFS offered an isgrab(inode), like igrab() but acquiring s_active
>>> reference while holding i_lock, that would drop very easily into the
>>> current shmem_unuse() as a replacement there for igrab(). But the rest
>>> of the world has managed without that for years, so I'm disinclined to
>>> add it just for this. And the patch below seems good enough without it.
>>>
>>> Thanks,
>>> Hugh
>>>
>>> ---
>>>
>>> include/linux/shmem_fs.h | 1 +
>>> mm/shmem.c | 39 ++++++++++++++++++---------------------
>>> 2 files changed, 19 insertions(+), 21 deletions(-)
>>>
>>> --- 5.1-rc3/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700
>>> +++ linux/include/linux/shmem_fs.h 2019-04-04 16:18:08.193512968 -0700
>>> @@ -21,6 +21,7 @@ struct shmem_inode_info {
>>> struct list_head swaplist; /* chain of maybes on swap */
>>> struct shared_policy policy; /* NUMA memory alloc policy
>>> */
>>> struct simple_xattrs xattrs; /* list of xattrs */
>>> + atomic_t stop_eviction; /* hold when working on inode
>>> */
>>> struct inode vfs_inode;
>>> };
>>> --- 5.1-rc3/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700
>>> +++ linux/mm/shmem.c 2019-04-04 16:18:08.193512968 -0700
>>> @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
>>> }
>>> spin_unlock(&sbinfo->shrinklist_lock);
>>> }
>>> - if (!list_empty(&info->swaplist)) {
>>> + while (!list_empty(&info->swaplist)) {
>>> + /* Wait while shmem_unuse() is scanning this inode...
>>> */
>>> + wait_var_event(&info->stop_eviction,
>>> + !atomic_read(&info->stop_eviction));
>>> mutex_lock(&shmem_swaplist_mutex);
>>> list_del_init(&info->swaplist);
>>> + /* ...but beware of the race if we peeked too early
>>> */
>>> + if (!atomic_read(&info->stop_eviction))
>>> + list_del_init(&info->swaplist);
>>> mutex_unlock(&shmem_swaplist_mutex);
>>> }
>>> }
>>> @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
>>> unsigned long *fs_pages_to_unuse)
>>> {
>>> struct shmem_inode_info *info, *next;
>>> - struct inode *inode;
>>> - struct inode *prev_inode = NULL;
>>> int error = 0;
>>> if (list_empty(&shmem_swaplist))
>>> return 0;
>>> mutex_lock(&shmem_swaplist_mutex);
>>> -
>>> - /*
>>> - * The extra refcount on the inode is necessary to safely dereference
>>> - * p->next after re-acquiring the lock. New shmem inodes with swap
>>> - * get added to the end of the list and we will scan them all.
>>> - */
>>> list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
>>> if (!info->swapped) {
>>> list_del_init(&info->swaplist);
>>> continue;
>>> }
>>> -
>>> - inode = igrab(&info->vfs_inode);
>>> - if (!inode)
>>> - continue;
>>> -
>>> + /*
>>> + * Drop the swaplist mutex while searching the inode for
>>> swap;
>>> + * but before doing so, make sure shmem_evict_inode() will
>>> not
>>> + * remove placeholder inode from swaplist, nor let it be
>>> freed
>>> + * (igrab() would protect from unlink, but not from unmount).
>>> + */
>>> + atomic_inc(&info->stop_eviction);
>>> mutex_unlock(&shmem_swaplist_mutex);
>>> - if (prev_inode)
>>> - iput(prev_inode);
>>> - prev_inode = inode;
>> This seems too ad hoc solution.
>
> I see what you mean by "ad hoc", but disagree with "too" ad hoc:
> it's an appropriate solution, and a general one - I didn't invent it
> for this, but for the huge tmpfs recoveries work items four years ago;
> just changed the name from "info->recoveries" to "info->stop_eviction"
> to let it be generalized to this swapoff case.
>
> I prefer mine, since it simplifies shmem_unuse() (no igrab!), and has
> the nice (but admittedly not essential) property of letting swapoff
> proceed without delay and without unnecessary locking on unmounting
> filesystems and evicting inodes. (Would I prefer to use the s_umount
> technique for my recoveries case? I think not.) >
> But yours should work too, with a slight change - see comments below,
> where I've inlined yours. I'd better get on and post my four fixes
> tomorrow, whether or not they fix Alex's case; then if people prefer
> yours to my 4/4, yours can be swapped in instead.
>
Ok. But both swapoff and tmpfs umount does not look like
operations that should be concurrent by any cost.
>> shmem: fix race between shmem_unuse and umount
>>
>> From: Konstantin Khlebnikov <[email protected]>
>>
>> Function shmem_unuse could race with generic_shutdown_super.
>> Inode reference is not enough for preventing umount and freeing superblock.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> ---
>> mm/shmem.c | 24 +++++++++++++++++++++++-
>> 1 file changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index b3db3779a30a..2018a9a96bb7 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1218,6 +1218,10 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type,
>> return ret;
>> }
>>
>> +static void shmem_synchronize_umount(struct super_block *sb, void *arg)
>> +{
>> +}
>> +
>
> I think this can go away, see below.
>
>> /*
>> * Read all the shared memory data that resides in the swap
>> * device 'type' back into memory, so the swap device can be
>> @@ -1229,6 +1233,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
>> struct shmem_inode_info *info, *next;
>> struct inode *inode;
>> struct inode *prev_inode = NULL;
>> + struct super_block *sb;
>> int error = 0;
>>
>> if (list_empty(&shmem_swaplist))
>> @@ -1247,9 +1252,22 @@ int shmem_unuse(unsigned int type, bool frontswap,
>> continue;
>> }
>>
>> + /*
>> + * Lock superblock to prevent umount and freeing it under us.
>> + * If umount in progress it will free swap enties.
>> + *
>> + * Must be done before grabbing inode reference, otherwise
>> + * generic_shutdown_super() will complain about busy inodes.
>> + */
>> + sb = info->vfs_inode.i_sb;
>> + if (!trylock_super(sb))
>
> Right, trylock important there.
>
>> + continue;
>> +
>> inode = igrab(&info->vfs_inode);
>> - if (!inode)
>> + if (!inode) {
>> + up_read(&sb->s_umount);
>
> Yes, that indeed avoids the difficulty I had with when to call
> deactivate_super(), that put me off trying to use s_active.
>
>> continue;
>> + }
>>
>> mutex_unlock(&shmem_swaplist_mutex);
>> if (prev_inode)
>> @@ -1258,6 +1276,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
>>
>> error = shmem_unuse_inode(inode, type, frontswap,
>> fs_pages_to_unuse);
>> + up_read(&sb->s_umount);
>
> No, not here. I think you have to note prev_sb, and then only
> up_read(&prev_sb->s_umount) after each iput(prev_inode): otherwise
> there's still a risk of "Self-destruct in 5 seconds", isn't there?
Oh yes. So, this code have to swap sb locks above with this monster
if (sb != info->vfs_inode.i_sb) {
if (sb)
up_read(&sb->s_umount);
sb = NULL;
if (!trylock_super(info->vfs_inode.i_sb))
continue;
sb = info->vfs_inode.i_sb
}
Locking shmem_swaplist_mutex under s_umount should be fine.
Also I looking into idea of treating swapoff like reverse-writeback:
-> iterate over superblocks
-> lock s_umount with normal down_read
-> iterate over inodes
-> iterate over inode tags
-> ...
Whole code will be more natural in this way.
>
>> cond_resched();
>>
>> mutex_lock(&shmem_swaplist_mutex);
>> @@ -1272,6 +1291,9 @@ int shmem_unuse(unsigned int type, bool frontswap,
>> if (prev_inode)
>> iput(prev_inode);
>>
>> + /* Wait for umounts, this grabs s_umount for each superblock. */
>> + iterate_supers_type(&shmem_fs_type, shmem_synchronize_umount, NULL);
>> +
>
> I guess that's an attempt to compensate for the somewhat unsatisfactory
> trylock above (bearing in mind the SWAP_UNUSE_MAX_TRIES 3, but I remove
> that in my 2/4). Nice idea, and if it had the effect of never needing to
> retry shmem_unuse(), I'd say yes; but since you're still passing over
> un-igrab()-able inodes without an equivalent synchronization, I think
> this odd iterate_supers_type() just delays swapoff without buying any
> guarantee: better just deleted to keep your patch simpler.
Yep, robust algorithm is better than try-3-times-and-give-up =)
(could hide bugs for ages)
I suppose your solution will wait for wakeup from shmem_evict_inode()?
That should work. In more general design this could be something like
__wait_on_freeing_inode(), but with killable wait.
>
>> return error;
>> }
>>
>
> Thanks,
> Hugh
>
On Mon, 8 Apr 2019, Konstantin Khlebnikov wrote:
>
> I suppose your solution will wait for wakeup from shmem_evict_inode()?
No, it's the other way round: shmem_unuse() gets on with its work without
delay, shmem_evict_inode() waits until the stop_eviction count has gone
down to zero, saying nobody else is at work on the inode.
Waiting in shmem_evict_inode() might be more worrying, if it weren't
already packed full with lock_page()s. And less attractive with the old
quadratic style of swapoff, when shmem_evict_inode() would have freed
the inode's swap much more efficiently than swapoff could then manage.
Hugh