On Fri, Jan 18, 2019 at 5:12 PM Curtis Malainey <[email protected]> wrote:
>
>
>
> On Fri, Jan 18, 2019 at 3:02 PM Pierre-Louis Bossart <[email protected]> wrote:
>>
>>
>> On 1/15/19 3:16 PM, Pierre-Louis Bossart wrote:
>> >
>> >>> Beyond the fact that the platform_name seems to be totally useless,
>> >>> additional tests show that the patch ('ASoC: soc-core: defer card probe
>> >>> until all component is added to list') adds a new restriction which
>> >>> contradicts existing error checks.
>> >>>
>> >>> None of the Intel machine drivers set the dailink "cpu_name" field
>> >>> but use
>> >>> the "cpu_dai_name" field instead. This was perfectly legit as
>> >>> documented by
>> >>> the code at the end of soc_init_dai_link()
>> >> This should be fixed by the patch
>> >> "ASoC: core: Don't defer probe on optional, NULL components" which Mark
>> >> already applied to his tree. See
>> >> http://mailman.alsa-project.org/pipermail/alsa-devel/2019-January/144323.html
>> >>
>> >
>> > Ah yes, I missed this patch while I was debugging. Indeed this fixes
>> > the problem and my devices work again with Mark's for-next branch.
>> > Thanks Matthias!
>>
>> This PROBE_DEFER support actually breaks the topology override that
>> we've been relying on for SOF (and which has been in Mark's branch for
>> some time now). This override helps us reuse machine drivers between
>> legacy and SOF-based solutions.
>>
>> With the current code, the tests in soc_register_card() complain that
>> the platform_name can't be tied to a component and stop the card
>> registration, but that's mainly because the tests are done before the
>> topology overrides are done in soc_check_tplg_fes(). Moving
>> soc_check_tplg_fes() from soc_instantiate_card() to an earlier time in
>> soc_register_card() works-around the problem but looks quite invasive
>> (mutex lock, etc).
>>
>> There is also a second problem where we seem to have a memory management
>> issue root caused to the change in snd_soc_init_platform() added by
>> 09ac6a817bd6 ('ASoC: soc-core: fix init platform memory handling')
>>
>> The code does this
>>
>> static int snd_soc_init_platform(struct snd_soc_card *card,
>> struct snd_soc_dai_link *dai_link)
>> {
>> struct snd_soc_dai_link_component *platform = dai_link->platform;
>>
>>
>> /* convert Legacy platform link */
>> if (!platform || dai_link->legacy_platform) {
>> platform = devm_kzalloc(card->dev,
>> sizeof(struct snd_soc_dai_link_component),
>> GFP_KERNEL);
>> if (!platform)
>> return -ENOMEM;
>>
>> dai_link->platform = platform;
>> dai_link->legacy_platform = 1;
>>
>> This last assignment guarantees that memory will be allocated every time
>> this function is called, and whatever overrides are done later will
>> themselves be overridden by the new allocation. I am not sure what the
>> intent was here, Curtis can you please double-check?
>>
The issue was that we were seeing a memory corruption bug on an AMD
chromebooks with that function already (not observed on Intel). I was
testing some SOF integrations and was seeing this in the kernel logs.
I had Dylan verify my logic before I sent the patch because it took so
long to identify the bug and it was traced to the patch that introduce
soc_init_platform.
[ 10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
designware-i2s.1.auto not registered
[ 10.922122] cz-da7219-max98357a AMD7219:00:
devm_snd_soc_register_card(acpd7219m98357) failed: -517
[ 11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
name/of_node are set for amd-max98357-play
[ 11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
link amd-max98357-play
[ 11.001431] cz-da7219-max98357a AMD7219:00:
devm_snd_soc_register_card(acpd7219m98357) failed: -22
[ 11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22
of_node was never getting set but the pointer was becoming populated
(outside of the probe call) which traced to soc_init_platform function
which was not reallocating memory on a EPROBE_DEFER even though it was
getting freed by devm. I am not very familiar with devm but my local
maintainers say that it should be freeing the memory even on a
PROBE_DEFER.
The patch should mirror the memory behaviour in
snd_soc_init_multicodec which also reallocates its memory on every
probe. I'm not sure how the patch is causing you to defer, is your
component list corrupt?
Sorry for the duplicate spam, forgot to send via plain text mode,
re-sending for the mailing list so it gets accepted.
>
>> Details, test code and logs are available here:
>> https://github.com/thesofproject/linux/issues/565
>>
>> Have a nice week-end everyone, that's it for me until Tuesday.
>>
>> -Pierre
>>
>>
>>
On Fri, Jan 18, 2019 at 05:15:32PM -0800, Curtis Malainey wrote:
> of_node was never getting set but the pointer was becoming populated
> (outside of the probe call) which traced to soc_init_platform function
> which was not reallocating memory on a EPROBE_DEFER even though it was
> getting freed by devm. I am not very familiar with devm but my local
> maintainers say that it should be freeing the memory even on a
> PROBE_DEFER.
Probe deferral is just like any other error from probe, any managed
resources will be unwound.
> The issue was that we were seeing a memory corruption bug on an AMD
> chromebooks with that function already (not observed on Intel). I was
> testing some SOF integrations and was seeing this in the kernel logs.
> I had Dylan verify my logic before I sent the patch because it took so
> long to identify the bug and it was traced to the patch that introduce
> soc_init_platform.
>
> [ 10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
> designware-i2s.1.auto not registered
> [ 10.922122] cz-da7219-max98357a AMD7219:00:
> devm_snd_soc_register_card(acpd7219m98357) failed: -517
> [ 11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
> name/of_node are set for amd-max98357-play
> [ 11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
> link amd-max98357-play
> [ 11.001431] cz-da7219-max98357a AMD7219:00:
> devm_snd_soc_register_card(acpd7219m98357) failed: -22
> [ 11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22
>
> of_node was never getting set but the pointer was becoming populated
> (outside of the probe call) which traced to soc_init_platform function
> which was not reallocating memory on a EPROBE_DEFER even though it was
> getting freed by devm. I am not very familiar with devm but my local
> maintainers say that it should be freeing the memory even on a
> PROBE_DEFER.
> The patch should mirror the memory behaviour in
> snd_soc_init_multicodec which also reallocates its memory on every
> probe. I'm not sure how the patch is causing you to defer, is your
> component list corrupt?
>
> Sorry for the duplicate spam, forgot to send via plain text mode,
> re-sending for the mailing list so it gets accepted.
There is no defer issue with the intel stuff, but we call this routine
multiple times
snd_soc_register_card
--soc_init_dai_link
----snd_soc_init_platform
-- soc_soc_bind_card
----snd_soc_instantiate_card
------ soc_check_tplg_fes
-------- snd_soc_init_platform << ALLOC1
--------soc_init_dai_link
----------snd_soc_init_platform << ALLOC2
Initially dai_link->legacy_platform is 0, so gets set after the first
first devm_kzalloc (ALLOC1) and after that we always allocate new memory
(ALLOC2). The end result is that whatever we set in soc_check_tplg_fes
is lost with the new/unnecessary alloc.
I would guess your solution is also a work-around, if devm_ effectively
freed the memory then the pointer would become NULL. Or may that's the
issue is that no one actually resets it.
Curtis Malainey | Software Engineer | [email protected] | 650-898-3849
On Wed, Jan 23, 2019 at 4:11 AM Pierre-Louis Bossart
<[email protected]> wrote:
>
>
> > The issue was that we were seeing a memory corruption bug on an AMD
> > chromebooks with that function already (not observed on Intel). I was
> > testing some SOF integrations and was seeing this in the kernel logs.
> > I had Dylan verify my logic before I sent the patch because it took so
> > long to identify the bug and it was traced to the patch that introduce
> > soc_init_platform.
> >
> > [ 10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
> > designware-i2s.1.auto not registered
> > [ 10.922122] cz-da7219-max98357a AMD7219:00:
> > devm_snd_soc_register_card(acpd7219m98357) failed: -517
> > [ 11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
> > name/of_node are set for amd-max98357-play
> > [ 11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
> > link amd-max98357-play
> > [ 11.001431] cz-da7219-max98357a AMD7219:00:
> > devm_snd_soc_register_card(acpd7219m98357) failed: -22
> > [ 11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22
> >
> > of_node was never getting set but the pointer was becoming populated
> > (outside of the probe call) which traced to soc_init_platform function
> > which was not reallocating memory on a EPROBE_DEFER even though it was
> > getting freed by devm. I am not very familiar with devm but my local
> > maintainers say that it should be freeing the memory even on a
> > PROBE_DEFER.
> > The patch should mirror the memory behaviour in
> > snd_soc_init_multicodec which also reallocates its memory on every
> > probe. I'm not sure how the patch is causing you to defer, is your
> > component list corrupt?
> >
> > Sorry for the duplicate spam, forgot to send via plain text mode,
> > re-sending for the mailing list so it gets accepted.
>
> There is no defer issue with the intel stuff, but we call this routine
> multiple times
>
> snd_soc_register_card
>
> --soc_init_dai_link
>
> ----snd_soc_init_platform
>
> -- soc_soc_bind_card
>
> ----snd_soc_instantiate_card
>
> ------ soc_check_tplg_fes
>
> -------- snd_soc_init_platform << ALLOC1
>
> --------soc_init_dai_link
>
> ----------snd_soc_init_platform << ALLOC2
>
Ah that explains it, in my testing I didn't have the patch that
brought in the call from within tplg_fes
>
> Initially dai_link->legacy_platform is 0, so gets set after the first
> first devm_kzalloc (ALLOC1) and after that we always allocate new memory
> (ALLOC2). The end result is that whatever we set in soc_check_tplg_fes
> is lost with the new/unnecessary alloc.
>
> I would guess your solution is also a work-around, if devm_ effectively
> freed the memory then the pointer would become NULL. Or may that's the
> issue is that no one actually resets it.
>
>
Yes, its a work around to fix the memory issue. If you set the
platform in the machine driver the code will ignore it and not reset
it. That being said that is not a full proof workaround and a better
solution is definitely needed. We could go and clean up the pointers
in soc_instantiate_card based on the flag being set. That way we only
relocate on a NULL pointer like we used to but still don't affect
statically allocated memory. I will draft a patch, test it on the AMD
device, reply to this thread later with it, Pierre can you test it as
well?
I am curious why soc_check_tplg_fes is calling snd_soc_init_platform.
It should have already been called earlier, in soc_init_dai_link at
the beginning of snd_soc_register_card so the memory should already be
initialized. Unless I am missing somewhere where links are getting
added between the calls.
On 1/22/19 7:36 PM, Curtis Malainey wrote:
> Curtis Malainey | Software Engineer | [email protected] | 650-898-3849
>
>
> On Wed, Jan 23, 2019 at 4:11 AM Pierre-Louis Bossart
> <[email protected]> wrote:
>>
>>> The issue was that we were seeing a memory corruption bug on an AMD
>>> chromebooks with that function already (not observed on Intel). I was
>>> testing some SOF integrations and was seeing this in the kernel logs.
>>> I had Dylan verify my logic before I sent the patch because it took so
>>> long to identify the bug and it was traced to the patch that introduce
>>> soc_init_platform.
>>>
>>> [ 10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
>>> designware-i2s.1.auto not registered
>>> [ 10.922122] cz-da7219-max98357a AMD7219:00:
>>> devm_snd_soc_register_card(acpd7219m98357) failed: -517
>>> [ 11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
>>> name/of_node are set for amd-max98357-play
>>> [ 11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
>>> link amd-max98357-play
>>> [ 11.001431] cz-da7219-max98357a AMD7219:00:
>>> devm_snd_soc_register_card(acpd7219m98357) failed: -22
>>> [ 11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22
>>>
>>> of_node was never getting set but the pointer was becoming populated
>>> (outside of the probe call) which traced to soc_init_platform function
>>> which was not reallocating memory on a EPROBE_DEFER even though it was
>>> getting freed by devm. I am not very familiar with devm but my local
>>> maintainers say that it should be freeing the memory even on a
>>> PROBE_DEFER.
>>> The patch should mirror the memory behaviour in
>>> snd_soc_init_multicodec which also reallocates its memory on every
>>> probe. I'm not sure how the patch is causing you to defer, is your
>>> component list corrupt?
>>>
>>> Sorry for the duplicate spam, forgot to send via plain text mode,
>>> re-sending for the mailing list so it gets accepted.
>> There is no defer issue with the intel stuff, but we call this routine
>> multiple times
>>
>> snd_soc_register_card
>>
>> --soc_init_dai_link
>>
>> ----snd_soc_init_platform
>>
>> -- soc_soc_bind_card
>>
>> ----snd_soc_instantiate_card
>>
>> ------ soc_check_tplg_fes
>>
>> -------- snd_soc_init_platform << ALLOC1
>>
>> --------soc_init_dai_link
>>
>> ----------snd_soc_init_platform << ALLOC2
>>
> Ah that explains it, in my testing I didn't have the patch that
> brought in the call from within tplg_fes
>> Initially dai_link->legacy_platform is 0, so gets set after the first
>> first devm_kzalloc (ALLOC1) and after that we always allocate new memory
>> (ALLOC2). The end result is that whatever we set in soc_check_tplg_fes
>> is lost with the new/unnecessary alloc.
>>
>> I would guess your solution is also a work-around, if devm_ effectively
>> freed the memory then the pointer would become NULL. Or may that's the
>> issue is that no one actually resets it.
>>
>>
> Yes, its a work around to fix the memory issue. If you set the
> platform in the machine driver the code will ignore it and not reset
> it. That being said that is not a full proof workaround and a better
> solution is definitely needed. We could go and clean up the pointers
> in soc_instantiate_card based on the flag being set. That way we only
> relocate on a NULL pointer like we used to but still don't affect
> statically allocated memory. I will draft a patch, test it on the AMD
> device, reply to this thread later with it, Pierre can you test it as
> well?
>
> I am curious why soc_check_tplg_fes is calling snd_soc_init_platform.
> It should have already been called earlier, in soc_init_dai_link at
> the beginning of snd_soc_register_card so the memory should already be
> initialized. Unless I am missing somewhere where links are getting
> added between the calls.
This is actually a second order problem, the main issue i have is that
the very first call to init_dai_link fails with the new DEFER_PROBE
handling.
I don't quite understand what Linaro/AMD folks are doing but I trust
their changes are legitimate. To move forward, maybe it's not worth
spending too much time on a grand unification of string theory, there
are simpler solutions: the Intel machine drivers already do get the
platform driver name as an platform_data argument, so we could modify
the dailinks platform names before even registering the card. I tested
with the attached proof-of-concept patch, it adds 2 lines of code per
machine driver if we use a common helper (after the transition to the
"modern" dailink representation that's needed anyways) so maybe it's
better in the end? the override we care about is really the automatic
handling of all the hard-coded front-ends, the platform-name override
isn't really a battle i want to pick or spend time on.
On Tue, Jan 22, 2019 at 08:01:15PM -0600, Pierre-Louis Bossart wrote:
> changes are legitimate. To move forward, maybe it's not worth spending too
> much time on a grand unification of string theory, there are simpler
> solutions: the Intel machine drivers already do get the platform driver name
> as an platform_data argument, so we could modify the dailinks platform names
> before even registering the card. I tested with the attached
Yes, that would be much better - it's vastly more idiomatic. The
general idea is that a machine driver should know what it's expecting to
find before it starts probing.
>> changes are legitimate. To move forward, maybe it's not worth spending too
>> much time on a grand unification of string theory, there are simpler
>> solutions: the Intel machine drivers already do get the platform driver name
>> as an platform_data argument, so we could modify the dailinks platform names
>> before even registering the card. I tested with the attached
> Yes, that would be much better - it's vastly more idiomatic. The
> general idea is that a machine driver should know what it's expecting to
> find before it starts probing.
Thanks for the feedback, will send a formal patch with the helper and
machine driver changes after I test more with the legacy drivers. Do you
have a preference for one patch that deals with multiple machines
drivers in one shot, or individual patches? The latter are nicer for
backports (e.g. for Chrome), the former nicer for maintainers...
The goal of reusing machine drivers as is isn't really achievable
anyways, it looks like we are going to have additional changes, e.g. if
we want to avoid the calls to snd_pcm_suspend as suggested by Takashi,
we'll have to add a reference to snd_soc_pm_ops that's only used for
SOF, the Atom/SST driver does things in different ways mostly due to
historical reasons.
-Pierre
On Thu, Jan 24, 2019 at 01:07:17PM -0600, Pierre-Louis Bossart wrote:
> Thanks for the feedback, will send a formal patch with the helper and
> machine driver changes after I test more with the legacy drivers. Do you
> have a preference for one patch that deals with multiple machines drivers in
> one shot, or individual patches? The latter are nicer for backports (e.g.
> for Chrome), the former nicer for maintainers...
More patches is good, it doesn't make a huge difference if I get one big
patch or a series of repetitive patches - big serieses are more of an
issue if they're all different patches needing individual review.
I have a patch to fix the memory leak but I haven't been able to test
it yet because I am remote right now and I accidentally bootlooped the
AMD device I am working on. I will have this tested early next week.
Here is the patch for anyone interested.
Curtis Malainey | Software Engineer | [email protected] | 650-898-3849
On Fri, Jan 25, 2019 at 3:26 AM Mark Brown <[email protected]> wrote:
>
> On Thu, Jan 24, 2019 at 01:07:17PM -0600, Pierre-Louis Bossart wrote:
>
> > Thanks for the feedback, will send a formal patch with the helper and
> > machine driver changes after I test more with the legacy drivers. Do you
> > have a preference for one patch that deals with multiple machines drivers in
> > one shot, or individual patches? The latter are nicer for backports (e.g.
> > for Chrome), the former nicer for maintainers...
>
> More patches is good, it doesn't make a huge difference if I get one big
> patch or a series of repetitive patches - big serieses are more of an
> issue if they're all different patches needing individual review.