2015-02-23 17:09:20

by Thorsten.Bschorr

[permalink] [raw]
Subject: Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data

Hi,

I have observed a null-pointer access in w1/slaves/w1_therm on my Raspberry
Pi running 3.18.5 with several DS18S20 temperature sensors. I have already
discussed the problem with Evgeniy.

@Evgeniy & David: sorry for re-sending my message, my email
accidentally contained a HTML part.



w1_therm: w1_slave_show checks if the sensor uses external power. If this
is the case, the mutex on dev->bus_mutex is unlocked while waiting 750 ms
for the sensor to convert the temperature. Before reading the temperature,
the mutex is again locked.

During this sleep-time, the sensor could get detached, for example by w1.c:
w1_search_process_cb not finding the sensor (*):
!test_bit(W1_SLAVE_ACTIVE, &sl->flags) ==true.
This triggers w1_slave_detach, and hence w1_therm_remove_slave frees and
nulls sl->family_data.

w1_slave_show does not check if familiy_data is null after re-acquiring the
bus_mutex resulting in a null-ptr access when storing data.

After I've added checks for family_data!=0, I did not observe any more
crashes; the other data of struct w1_slave seem to be valid as long as any
thread executes w1_slave_show.


I have added debug-output to w1.c and w1_therm.c, here's a log of a
potential crash:

[184199.510227] w1_master_driver w1_bus_master2: w1_search_process_cb,
!W1_SLAVE_ACTIVE, calling w1_slave_detach
[184199.510276] w1_slave_driver 10-000802d9c9e4: w1_slave_detach
destroy_now 1
[184199.510297] w1_slave_driver 10-000802d9c9e4: w1_unref_slave refcnt 0
[184199.510321] w1_slave_driver 10-000802d9c9e4: w1_unref_slave: detaching
10-000802d9c9e4 [d5fb8800].
[184199.510347] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
w1_family_notify
[184199.510365] w1_slave_driver 10-000802d9c9e4: w1_family_notify calling
remove_slave
[184199.510382] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
[184199.510400] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
refcnt -1
[184200.049745] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
sleep), family_data==0
[184200.137133] w1_slave_driver 10-000802d9c9e4: w1_slave_show (before
sleep), family_data==0
[184200.889551] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
sleep), family_data==0
[184200.930866] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
sleep), family_data==0
[184200.930907] w1_slave_driver 10-000802d9c9e4: Read failed CRC check
[184200.931002] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
device_unregister
[184200.931169] w1 w1_unref_slave -> kfree

Note: When this crash happened, multiple threads were reading the sensor.


I could trigger the problem several times, and each time device_unregister
in w1_unref_slave was executed *after* w1_slave_show. In one case, the
logged time-difference between the first family_data==0 message and
device_unregister was about 8 seconds!
I have not observed a w1_slave_show call after w1_unref_slave (as long as
the device was not re-attached again).

>From my observation, the w1_slave data seem to be valid as long as
w1_slave_show is executed.
My guess is that the call of sysfs_remove_groups in w1_family_notify hits a
mutex (I did not add debug output here).


(*) On my tiny raspberry, this happens from time to time with high CPU and
external disc load (timing and/or electrical issues); it seems that the
sensor does not respond in time to the (periodic) search.


Please email me if you need further information.



Best regards,
Thorsten Bschorr


2015-02-24 01:51:41

by David Fries

[permalink] [raw]
Subject: Re: Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data

Thorsten,
Are you planning on submitting a patch?

FYI, I load the one wire module with
wire search_count=1
in
/etc/modules
which tells it to search the bus once only. That works for me as
devices don't come and go on my bus, and it isn't scanning the bus
every few seconds just to find out nothing changed. If you are
finding devices vanish even when they aren't maybe it will be useful
to you as well, though the bigger problem might be why they are
vanishing in the first place.

On Mon, Feb 23, 2015 at 06:09:16PM +0100, Thorsten Bschorr wrote:
> Hi,
>
> I have observed a null-pointer access in w1/slaves/w1_therm on my Raspberry
> Pi running 3.18.5 with several DS18S20 temperature sensors. I have already
> discussed the problem with Evgeniy.
>
> @Evgeniy & David: sorry for re-sending my message, my email
> accidentally contained a HTML part.
>
>
>
> w1_therm: w1_slave_show checks if the sensor uses external power. If this
> is the case, the mutex on dev->bus_mutex is unlocked while waiting 750 ms
> for the sensor to convert the temperature. Before reading the temperature,
> the mutex is again locked.
>
> During this sleep-time, the sensor could get detached, for example by w1.c:
> w1_search_process_cb not finding the sensor (*):
> !test_bit(W1_SLAVE_ACTIVE, &sl->flags) ==true.
> This triggers w1_slave_detach, and hence w1_therm_remove_slave frees and
> nulls sl->family_data.
>
> w1_slave_show does not check if familiy_data is null after re-acquiring the
> bus_mutex resulting in a null-ptr access when storing data.
>
> After I've added checks for family_data!=0, I did not observe any more
> crashes; the other data of struct w1_slave seem to be valid as long as any
> thread executes w1_slave_show.
>
>
> I have added debug-output to w1.c and w1_therm.c, here's a log of a
> potential crash:
>
> [184199.510227] w1_master_driver w1_bus_master2: w1_search_process_cb,
> !W1_SLAVE_ACTIVE, calling w1_slave_detach
> [184199.510276] w1_slave_driver 10-000802d9c9e4: w1_slave_detach
> destroy_now 1
> [184199.510297] w1_slave_driver 10-000802d9c9e4: w1_unref_slave refcnt 0
> [184199.510321] w1_slave_driver 10-000802d9c9e4: w1_unref_slave: detaching
> 10-000802d9c9e4 [d5fb8800].
> [184199.510347] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
> w1_family_notify
> [184199.510365] w1_slave_driver 10-000802d9c9e4: w1_family_notify calling
> remove_slave
> [184199.510382] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
> [184199.510400] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
> refcnt -1
> [184200.049745] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
> sleep), family_data==0
> [184200.137133] w1_slave_driver 10-000802d9c9e4: w1_slave_show (before
> sleep), family_data==0
> [184200.889551] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
> sleep), family_data==0
> [184200.930866] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
> sleep), family_data==0
> [184200.930907] w1_slave_driver 10-000802d9c9e4: Read failed CRC check
> [184200.931002] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
> device_unregister
> [184200.931169] w1 w1_unref_slave -> kfree
>
> Note: When this crash happened, multiple threads were reading the sensor.
>
>
> I could trigger the problem several times, and each time device_unregister
> in w1_unref_slave was executed *after* w1_slave_show. In one case, the
> logged time-difference between the first family_data==0 message and
> device_unregister was about 8 seconds!
> I have not observed a w1_slave_show call after w1_unref_slave (as long as
> the device was not re-attached again).
>
> >From my observation, the w1_slave data seem to be valid as long as
> w1_slave_show is executed.
> My guess is that the call of sysfs_remove_groups in w1_family_notify hits a
> mutex (I did not add debug output here).
>
>
> (*) On my tiny raspberry, this happens from time to time with high CPU and
> external disc load (timing and/or electrical issues); it seems that the
> sensor does not respond in time to the (periodic) search.
>
>
> Please email me if you need further information.
>
>
>
> Best regards,
> Thorsten Bschorr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
David Fries <[email protected]> PGP pub CB1EE8F0
http://fries.net/~david/

2015-02-25 09:28:16

by Thorsten.Bschorr

[permalink] [raw]
Subject: Re: Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data

David,

I can try to prepare a patch fixing the null-ptr access.

Thanks for the hint with the search_count, I'll try it.


2015-02-24 2:37 GMT+01:00 David Fries <[email protected]>:
> Thorsten,
> Are you planning on submitting a patch?
>
> FYI, I load the one wire module with
> wire search_count=1
> in
> /etc/modules
> which tells it to search the bus once only. That works for me as
> devices don't come and go on my bus, and it isn't scanning the bus
> every few seconds just to find out nothing changed. If you are
> finding devices vanish even when they aren't maybe it will be useful
> to you as well, though the bigger problem might be why they are
> vanishing in the first place.
>
> On Mon, Feb 23, 2015 at 06:09:16PM +0100, Thorsten Bschorr wrote:
>> Hi,
>>
>> I have observed a null-pointer access in w1/slaves/w1_therm on my Raspberry
>> Pi running 3.18.5 with several DS18S20 temperature sensors. I have already
>> discussed the problem with Evgeniy.
>>
>> @Evgeniy & David: sorry for re-sending my message, my email
>> accidentally contained a HTML part.
>>
>>
>>
>> w1_therm: w1_slave_show checks if the sensor uses external power. If this
>> is the case, the mutex on dev->bus_mutex is unlocked while waiting 750 ms
>> for the sensor to convert the temperature. Before reading the temperature,
>> the mutex is again locked.
>>
>> During this sleep-time, the sensor could get detached, for example by w1.c:
>> w1_search_process_cb not finding the sensor (*):
>> !test_bit(W1_SLAVE_ACTIVE, &sl->flags) ==true.
>> This triggers w1_slave_detach, and hence w1_therm_remove_slave frees and
>> nulls sl->family_data.
>>
>> w1_slave_show does not check if familiy_data is null after re-acquiring the
>> bus_mutex resulting in a null-ptr access when storing data.
>>
>> After I've added checks for family_data!=0, I did not observe any more
>> crashes; the other data of struct w1_slave seem to be valid as long as any
>> thread executes w1_slave_show.
>>
>>
>> I have added debug-output to w1.c and w1_therm.c, here's a log of a
>> potential crash:
>>
>> [184199.510227] w1_master_driver w1_bus_master2: w1_search_process_cb,
>> !W1_SLAVE_ACTIVE, calling w1_slave_detach
>> [184199.510276] w1_slave_driver 10-000802d9c9e4: w1_slave_detach
>> destroy_now 1
>> [184199.510297] w1_slave_driver 10-000802d9c9e4: w1_unref_slave refcnt 0
>> [184199.510321] w1_slave_driver 10-000802d9c9e4: w1_unref_slave: detaching
>> 10-000802d9c9e4 [d5fb8800].
>> [184199.510347] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
>> w1_family_notify
>> [184199.510365] w1_slave_driver 10-000802d9c9e4: w1_family_notify calling
>> remove_slave
>> [184199.510382] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
>> [184199.510400] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
>> refcnt -1
>> [184200.049745] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.137133] w1_slave_driver 10-000802d9c9e4: w1_slave_show (before
>> sleep), family_data==0
>> [184200.889551] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.930866] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.930907] w1_slave_driver 10-000802d9c9e4: Read failed CRC check
>> [184200.931002] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
>> device_unregister
>> [184200.931169] w1 w1_unref_slave -> kfree
>>
>> Note: When this crash happened, multiple threads were reading the sensor.
>>
>>
>> I could trigger the problem several times, and each time device_unregister
>> in w1_unref_slave was executed *after* w1_slave_show. In one case, the
>> logged time-difference between the first family_data==0 message and
>> device_unregister was about 8 seconds!
>> I have not observed a w1_slave_show call after w1_unref_slave (as long as
>> the device was not re-attached again).
>>
>> >From my observation, the w1_slave data seem to be valid as long as
>> w1_slave_show is executed.
>> My guess is that the call of sysfs_remove_groups in w1_family_notify hits a
>> mutex (I did not add debug output here).
>>
>>
>> (*) On my tiny raspberry, this happens from time to time with high CPU and
>> external disc load (timing and/or electrical issues); it seems that the
>> sensor does not respond in time to the (periodic) search.
>>
>>
>> Please email me if you need further information.
>>
>>
>>
>> Best regards,
>> Thorsten Bschorr
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
> --
> David Fries <[email protected]> PGP pub CB1EE8F0
> http://fries.net/~david/