On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 06:52:50PM -0400, Peter Xu wrote:
...
>
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index c48f8df6e50268..294c2c3c4fe00d 100644
> > > +++ b/mm/memory.c
> > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > > 0, src_vma, src_mm, addr, end);
> > > mmu_notifier_invalidate_range_start(&range);
> > > + /*
> > > + * The read side doesn't spin, it goes to the mmap_lock, so the
> > > + * raw version is used to avoid disabling preemption here
> > > + */
> > > + mmap_assert_write_locked(src_mm);
> > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> >
> > Would raw_write_seqcount_begin() be better here?
>
> Hum..
>
> I felt no because it had the preempt stuff added into it, however it
> would work - __seqcount_lock_preemptible() == false for the seqcount_t
> case (see below)
>
> Looking more closely, maybe the right API to pick is
> write_seqcount_t_begin() and write_seqcount_t_end() ??
>
No, that's not the right API: it is also internal to seqlock.h.
Please stick with the official exported API: raw_write_seqcount_begin().
It should satisfy your needs, and the raw_*() variant is created exactly
for contexts wishing to avoid the lockdep checks (e.g. NMI handlers
cannot invoke lockdep, etc.)
> However, no idea what the intention of the '*_seqcount_t_*' family is
> - it only seems to be used to implement the seqlock..
>
Exactly. '*_seqcount_t_*' is a seqlock.h implementation detail, and it
has _zero_ relevance to what is discussed in this thread actually.
...
> Ahmed explained in commit 8117ab508f the reason the seqcount_t write
> side has preemption disabled is because it can livelock RT kernels if
> the read side is spinning after preempting the write side. eg look at
> how __read_seqcount_begin() is implemented:
>
> while ((seq = __seqcount_sequence(s)) & 1) \
> cpu_relax(); \
>
> However, in this patch, we don't spin on the read side.
>
> If the read side collides with a writer it immediately goes to the
> mmap_lock, which is sleeping, and so it will sort itself out properly,
> even if it was preempted.
>
Correct.
Thanks,
--
Ahmed Darwish
Linutronix GmbH
On Tue, Nov 03, 2020 at 01:17:12AM +0100, Ahmed S. Darwish wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index c48f8df6e50268..294c2c3c4fe00d 100644
> > > > +++ b/mm/memory.c
> > > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > > mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > > > 0, src_vma, src_mm, addr, end);
> > > > mmu_notifier_invalidate_range_start(&range);
> > > > + /*
> > > > + * The read side doesn't spin, it goes to the mmap_lock, so the
> > > > + * raw version is used to avoid disabling preemption here
> > > > + */
> > > > + mmap_assert_write_locked(src_mm);
> > > > + raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> > >
> > > Would raw_write_seqcount_begin() be better here?
> >
> > Hum..
> >
> > I felt no because it had the preempt stuff added into it, however it
> > would work - __seqcount_lock_preemptible() == false for the seqcount_t
> > case (see below)
> >
> > Looking more closely, maybe the right API to pick is
> > write_seqcount_t_begin() and write_seqcount_t_end() ??
> >
>
> No, that's not the right API: it is also internal to seqlock.h.
>
> Please stick with the official exported API: raw_write_seqcount_begin().
>
> It should satisfy your needs, and the raw_*() variant is created exactly
> for contexts wishing to avoid the lockdep checks (e.g. NMI handlers
> cannot invoke lockdep, etc.)
Ahmed, Jason, feel free to correct me - but I feel like what Jason wanted here
is indeed the version that does not require disabling of preemption, a.k.a.,
write_seqcount_t_begin() and write_seqcount_t_end(), since it's preempt-safe if
the read side does not retry. Not sure whether there's no "*_t_*" version of it.
Another idea is that maybe we can use the raw_write_seqcount_begin() version,
instead of in copy_page_range() but move it to copy_pte_range(). That would
not affect normal Linux on preemption I think, since when reach pte level we
should have disabled preemption already after all (by taking the pgtable spin
lock). But again there could be extra overhead since we'll need to take the
write seqcount very often (rather than once per fork(), so maybe there's some
perf influence), also that means it'll be an extra/real disable_preempt() for
the future RT code if it'll land some day (since again rt_spin_lock should not
need to disable preemption, iiuc, which is used here). So seems it's still
better to do in copy_page_range() as Jason proposed.
Thanks,
--
Peter Xu