Since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
improved unlock_page(), it has become more noticeable how cow_user_page()
in a kernel with CONFIG_DMA_API_DEBUG=y can create and suffer from heavy
contention on DMA debug's radix_lock in debug_dma_assert_idle().
It is only doing a lookup: use rcu_read_lock() and rcu_read_unlock()
instead; though that does require the static ents[] to be moved onstack...
...but, hold on, isn't that radix_tree_gang_lookup() and loop doing quite
the wrong thing: searching CACHELINES_PER_PAGE entries for an exact match
with the first cacheline of the page in question? radix_tree_gang_lookup()
is the right tool for the job, but we need nothing more than to check the
first entry it can find, reporting if that falls anywhere within the page.
(Is RCU safe here? As safe as using the spinlock was. The entries are
never freed, so don't need to be freed by RCU. They may be reused, and
there is a faint chance of a race, with an offending entry reused while
printing its error info; but the spinlock did not prevent that either,
and I agree that it's not worth worrying about.)
Fixes: 3b7a6418c749 ("dma debug: account for cachelines and read-only mappings in overlap tracking")
Signed-off-by: Hugh Dickins <[email protected]>
---
kernel/dma/debug.c | 27 +++++++++------------------
1 file changed, 9 insertions(+), 18 deletions(-)
--- v5.9-rc/kernel/dma/debug.c 2020-08-05 18:17:57.544203766 -0700
+++ linux/kernel/dma/debug.c 2020-08-12 19:53:33.159070245 -0700
@@ -565,11 +565,8 @@ static void active_cacheline_remove(stru
*/
void debug_dma_assert_idle(struct page *page)
{
- static struct dma_debug_entry *ents[CACHELINES_PER_PAGE];
- struct dma_debug_entry *entry = NULL;
- void **results = (void **) &ents;
- unsigned int nents, i;
- unsigned long flags;
+ struct dma_debug_entry *entry;
+ unsigned long pfn;
phys_addr_t cln;
if (dma_debug_disabled())
@@ -578,20 +575,14 @@ void debug_dma_assert_idle(struct page *
if (!page)
return;
- cln = (phys_addr_t) page_to_pfn(page) << CACHELINE_PER_PAGE_SHIFT;
- spin_lock_irqsave(&radix_lock, flags);
- nents = radix_tree_gang_lookup(&dma_active_cacheline, results, cln,
- CACHELINES_PER_PAGE);
- for (i = 0; i < nents; i++) {
- phys_addr_t ent_cln = to_cacheline_number(ents[i]);
+ pfn = page_to_pfn(page);
+ cln = (phys_addr_t) pfn << CACHELINE_PER_PAGE_SHIFT;
- if (ent_cln == cln) {
- entry = ents[i];
- break;
- } else if (ent_cln >= cln + CACHELINES_PER_PAGE)
- break;
- }
- spin_unlock_irqrestore(&radix_lock, flags);
+ rcu_read_lock();
+ if (!radix_tree_gang_lookup(&dma_active_cacheline, (void **) &entry,
+ cln, 1) || entry->pfn != pfn)
+ entry = NULL;
+ rcu_read_unlock();
if (!entry)
return;
On Wed, Aug 12, 2020 at 8:17 PM Hugh Dickins <[email protected]> wrote:
>
> Since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> improved unlock_page(), it has become more noticeable how cow_user_page()
> in a kernel with CONFIG_DMA_API_DEBUG=y can create and suffer from heavy
> contention on DMA debug's radix_lock in debug_dma_assert_idle().
Ooh.
Yeah, that's ridiculously expensive, and serializes things for no good reason.
Your patch looks obviously correct to me (Christoph?), but it also
makes me go "why are we doing this in the first place"?
Because it looks to me like
(a) the debug check is wrong
(b) this is left-over from early debugging
In particular, I don't see why we couldn't do a COW on a page that is
under writeback at the same time. We're not changing the page that is
doing DMA.
In fact, the whole "COW with DMA" makes me feel like the real bug may
have been due that whole "ambiguous COW" thing, which was fixed in
17839856fd58 ("gup: document and work around "COW can break either
way" issue")
That debug thing goes back almost 7 years, and I don't think it has
caught anything in those seven years, but I could be wrong.
The commit that adds it does talk about a bug, but that code was
removed entirely eventually. And google shows no hits for
debug_dma_assert_idle() since - until your email.
So my gut feel is that we should remove the check entirely, although
your patch does seem like a big improvement.
Christoph?
(And Dan too, of course, in case he happens to be relaxing in front of
the computer away from a newborn baby ;)
Linus
On Thu, Aug 13, 2020 at 12:03 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Aug 12, 2020 at 8:17 PM Hugh Dickins <[email protected]> wrote:
> >
> > Since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> > improved unlock_page(), it has become more noticeable how cow_user_page()
> > in a kernel with CONFIG_DMA_API_DEBUG=y can create and suffer from heavy
> > contention on DMA debug's radix_lock in debug_dma_assert_idle().
>
> Ooh.
>
> Yeah, that's ridiculously expensive, and serializes things for no good reason.
>
> Your patch looks obviously correct to me (Christoph?), but it also
> makes me go "why are we doing this in the first place"?
>
> Because it looks to me like
> (a) the debug check is wrong
> (b) this is left-over from early debugging
>
> In particular, I don't see why we couldn't do a COW on a page that is
> under writeback at the same time. We're not changing the page that is
> doing DMA.
>
> In fact, the whole "COW with DMA" makes me feel like the real bug may
> have been due that whole "ambiguous COW" thing, which was fixed in
> 17839856fd58 ("gup: document and work around "COW can break either
> way" issue")
>
> That debug thing goes back almost 7 years, and I don't think it has
> caught anything in those seven years, but I could be wrong.
>
> The commit that adds it does talk about a bug, but that code was
> removed entirely eventually. And google shows no hits for
> debug_dma_assert_idle() since - until your email.
>
> So my gut feel is that we should remove the check entirely, although
> your patch does seem like a big improvement.
>
> Christoph?
>
> (And Dan too, of course, in case he happens to be relaxing in front of
> the computer away from a newborn baby ;)
>
I can at least confirm that it has not caught anything in a long while
except a false positive that needed a fix up.
https://lore.kernel.org/lkml/CAPcyv4hy_nNe8G0o8sMrz9A8HcdRzAuKgXmvdjKusAAA3Fow4g@mail.gmail.com/
Part of me says it's not doing anything worthwhile upstream, but I
wonder if it is keeping some people from submitting patches that play
these page reference shenanigans? I know they're out there. The land
of gup and truncate is where questionable kernel changes go to die.
Outside of that, Hugh's patch looks like a definite improvement so I'd
be inclined to run with that, but rip the whole facility out at the
next sign of a false positive.
On Thu, Aug 13, 2020 at 12:02:41PM -0700, Linus Torvalds wrote:
> Yeah, that's ridiculously expensive, and serializes things for no good reason.
>
> Your patch looks obviously correct to me (Christoph?),
It also looks correct to me.
> but it also
> makes me go "why are we doing this in the first place"?
>
> Because it looks to me like
> (a) the debug check is wrong
> (b) this is left-over from early debugging
>
> In particular, I don't see why we couldn't do a COW on a page that is
> under writeback at the same time. We're not changing the page that is
> doing DMA.
Yes. We don't need to check for a DMA to the device, but a DMA from
the device while under DMA obviously is bogus. But then again you'd
need to try really hard to do that.
> In fact, the whole "COW with DMA" makes me feel like the real bug may
> have been due that whole "ambiguous COW" thing, which was fixed in
> 17839856fd58 ("gup: document and work around "COW can break either
> way" issue")
>
> That debug thing goes back almost 7 years, and I don't think it has
> caught anything in those seven years, but I could be wrong.
>
> The commit that adds it does talk about a bug, but that code was
> removed entirely eventually. And google shows no hits for
> debug_dma_assert_idle() since - until your email.
>
> So my gut feel is that we should remove the check entirely, although
> your patch does seem like a big improvement.
>
> Christoph?
The whole thing predates my involvement with the code, but I defintively
think the patch from Hugh is a major improvement. But I would also
have no problem with just removing it entirely.
On Thu, Aug 13, 2020 at 10:42 PM Christoph Hellwig <[email protected]> wrote:
>
> The whole thing predates my involvement with the code, but I defintively
> think the patch from Hugh is a major improvement. But I would also
> have no problem with just removing it entirely.
I decided to just do both, since neither you nor Dan seemed to really object.
I applied Hugh's RCU read locking patch as a clear improvement, and
then I did a second patch that just removed this function entirely.
That sounds a bit odd, perhaps, but in case people decide to resurrect
the debugging code, I didn't want us to lose sight of Hugh's
improvement just because I then decided that we might as well go one
step further and just remove it entirely.
And the only real reason I care is that this whole COW and page lock
thing has showed up lately, and I like removing code.
I'm _very_ tempted to just apply my COW simplification patch that gets
rid of all the complex try-to-share cases entirely (and would also
obviate the whole forced-cow patch). I suspect it would effectively
remove almost all of the [un[lock_page() bottlenecks entirely, but
that code has decades of history and I suspect it's a bit too drastic
wrt KSM and the swap cache pages.
It would be lovely if the main source of page locking would really be
about just IO, but the page lock has also become the thing that
serializes almost everything related to page state. Which is why you
find it in contexts that are really not IO-related at all (not just
COW - page migration is the other one that has shown up a lot under
"heavy CPU loads" without really necessarily any IO component to it).
Linus
On Fri, 14 Aug 2020, Linus Torvalds wrote:
> On Thu, Aug 13, 2020 at 10:42 PM Christoph Hellwig <[email protected]> wrote:
> >
> > The whole thing predates my involvement with the code, but I defintively
> > think the patch from Hugh is a major improvement. But I would also
> > have no problem with just removing it entirely.
>
> I decided to just do both, since neither you nor Dan seemed to really object.
>
> I applied Hugh's RCU read locking patch as a clear improvement, and
> then I did a second patch that just removed this function entirely.
> That sounds a bit odd, perhaps, but in case people decide to resurrect
> the debugging code, I didn't want us to lose sight of Hugh's
> improvement just because I then decided that we might as well go one
> step further and just remove it entirely.
That's ideal, thanks - exactly the sequence I was hoping for.
(Another shortcoming in debug_dma_assert_idle(), that I hadn't wanted
to distract us by mentioning, is that it assumed that the mapping is
contained within one small page, whereas I believe one or more of the
DMA mapping functions take a size_t argument, that could in theory span
small pages - I guess more plausible inside a compound page; yet it
looked like only an initial entry would be put into the radix-tree.)
>
> And the only real reason I care is that this whole COW and page lock
> thing has showed up lately, and I like removing code.
>
> I'm _very_ tempted to just apply my COW simplification patch that gets
> rid of all the complex try-to-share cases entirely (and would also
> obviate the whole forced-cow patch). I suspect it would effectively
> remove almost all of the [un[lock_page() bottlenecks entirely, but
> that code has decades of history and I suspect it's a bit too drastic
> wrt KSM and the swap cache pages.
Yes, you're right to hold back.
I'd been looking there too (but backed off while speeding up the
fork was causing the "Hugh load" to "fail": it's the exit that now
wants speeding up, to please that test). I think it could well avoid
getting into page locking when mapcount is quickly seen to be high
(> 1? > 2? > bigger? I never did the logic), but the page locking
becomes important when mapcount looks low, yet swap might be involved.
We used to rely on page count there, and on trylock_page() only; but
there was at least one user whose app went wrong when occasionally we
COWed the page, just because something else momentarily took a reference
to it, or locked it. Around 2006, bug report from 2004: I did look up
the history a week ago, but was interrupted before taking notes.
>
> It would be lovely if the main source of page locking would really be
> about just IO, but the page lock has also become the thing that
> serializes almost everything related to page state. Which is why you
> find it in contexts that are really not IO-related at all (not just
> COW - page migration is the other one that has shown up a lot under
> "heavy CPU loads" without really necessarily any IO component to it).
>
> Linus
On Fri, Aug 14, 2020 at 5:26 PM Hugh Dickins <[email protected]> wrote:
>
> We used to rely on page count there, and on trylock_page() only; but
> there was at least one user whose app went wrong when occasionally we
> COWed the page, just because something else momentarily took a reference
> to it, or locked it. Around 2006, bug report from 2004: I did look up
> the history a week ago, but was interrupted before taking notes.
I actually think you may be talking about the exact problem that that
debug patch from Dan was originally created for:
0abdd7a81b7e dma-debug: introduce debug_dma_assert_idle()
77873803363c net_dma: mark broken
and your memory sounds exactly like that net_dma case (and the timing
matches roughly too - the NET_DMA code was merged in 2006, but I think
people had been playing trial games with it before that).
IOW, net_dma was horribly broken, and just couldn't deal with COW
because it did things wrong.
The thing is, doing extra COW's really shouldn't matter in _any_
half-way correct situation. There's a few cases:
- user space writing to it, so we COW.
This is the "simple" case that is obvious and we've always done the
same thing. User space will get the new copy, and there's no possible
situation when that can be wrong.
- get_user_pages() for reading.
This is the one we actually used to get wrong, and when another
user *didn't* cow, the data that was read might not match what the
original get_uiser_pages() case expected.
But in this case, the bug only happened when we didn't cow
aggressively enough.
- get_user_pages() for writing
This is another 'simple" case, because it does the COW at
get_user_pages() time and gets it's own copy (which is also installed
in the thread that does the GUP, of course, so a subsequent fork an
danother write can obviously cause *further* COW action).
But in no case should an extra COW matter. Except if somebody uses
get_user_pages() to write to the page, and the COW "hides" that write
by giving a new copy to whoever expected to see it, but that's exactly
the case that Dan's patch was supposed to notice.
And since it never triggered outside of that invalid net_dma case, I
don't think any other case really ever existed.
Yes, I can well imagine that some people loved the concept of that TCP
receive copy offload, but it really was broken, and was removed
entirely by Dan in commit 7bced397510a ("net_dma: simple removal") a
year after being marked broken (the author date makes it look like
it's just a couple of weeks after being marked broken, but the commit
date for that removal is September 2014).
So I don't think that the trylock and checking page counts is a
correctness issue.
It had better not be, because anybody that writes to a shared-cow page
without breaking COW is simply broken.
No, I really think that the real worry about doing more aggressive
copying is that it doesn't steal back the KSM page or the swap cache
page, so it will leave those pages around, and while they should then
be really easy for the VM to reclaim, I really worry that we have a
couple of decades of VM reclaim tuning with that swap cache reuse
behavior (KSM, not so much).
And while it works fine on my machine, I currently have 40GB or RAM
free, because honestly, the stuff I do doesn't need all that much
memory, and I ridiculously overspecced my new machine RAM'wise. So
nothing I will do would show any problems.
Linus