MIME-Version: 1.0
In-Reply-To: <20150224013752.GA25909@spacedout.fries.net>
References: <CA+1k2cH5Y+RngvEbgE3CW5g+bO3V1ytDJC=u6DLVDZvbEOhu5A@mail.gmail.com>
	<CA+1k2cF1frcEUu-L_cSJxTp=GKExn6Vt2rdCeY=zrhM62FUggw@mail.gmail.com>
	<CA+1k2cEwtY+U7TS3rpmMp5nEBokO8vwhcOiD0ExVQnuoB=XVLQ@mail.gmail.com>
	<CA+1k2cE4oLTHRLkh=F5qMKqssjQg_J4-W45CBNzZ0ocG9OSoEA@mail.gmail.com>
	<20150224013752.GA25909@spacedout.fries.net>
Date: Wed, 25 Feb 2015 10:28:12 +0100
Message-ID: <CA+1k2cGrZ2AX94s4yisbLKHot8YA1DFBuSKU0PzD7TeL9U0iPA@mail.gmail.com>
Subject: Re: Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data
From: Thorsten Bschorr <thorsten@bschorr.de>
To: David Fries <David@fries.net>
Cc: linux-kernel@vger.kernel.org, Evgeniy Polyakov <zbr@ioremap.net>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5018
Lines: 124

David,

I can try to prepare a patch fixing the null-ptr access.

Thanks for the hint with the search_count, I'll try it.


2015-02-24 2:37 GMT+01:00 David Fries <David@fries.net>:
> Thorsten,
> Are you planning on submitting a patch?
>
> FYI, I load the one wire module with
> wire search_count=1
> in
> /etc/modules
> which tells it to search the bus once only.  That works for me as
> devices don't come and go on my bus, and it isn't scanning the bus
> every few seconds just to find out nothing changed.  If you are
> finding devices vanish even when they aren't maybe it will be useful
> to you as well, though the bigger problem might be why they are
> vanishing in the first place.
>
> On Mon, Feb 23, 2015 at 06:09:16PM +0100, Thorsten Bschorr wrote:
>> Hi,
>>
>> I have observed a null-pointer access in w1/slaves/w1_therm on my Raspberry
>> Pi running 3.18.5 with several DS18S20 temperature sensors. I have already
>> discussed the problem with Evgeniy.
>>
>> @Evgeniy & David: sorry for re-sending my message, my email
>> accidentally contained a HTML part.
>>
>>
>>
>> w1_therm: w1_slave_show checks if the sensor uses external power. If this
>> is the case, the mutex on dev->bus_mutex is unlocked while waiting 750 ms
>> for the sensor to convert the temperature. Before reading the temperature,
>> the mutex is again locked.
>>
>> During this sleep-time, the sensor could get detached, for example by w1.c:
>> w1_search_process_cb not finding the sensor (*):
>>  !test_bit(W1_SLAVE_ACTIVE, &sl->flags) ==true.
>> This triggers w1_slave_detach, and hence w1_therm_remove_slave frees and
>> nulls sl->family_data.
>>
>> w1_slave_show does not check if familiy_data is null after re-acquiring the
>> bus_mutex resulting in a null-ptr access when storing data.
>>
>> After I've added checks for family_data!=0, I did not observe any more
>> crashes; the other data of struct w1_slave seem to be valid as long as any
>> thread executes w1_slave_show.
>>
>>
>> I have added debug-output to w1.c and w1_therm.c, here's a log of a
>> potential crash:
>>
>> [184199.510227] w1_master_driver w1_bus_master2: w1_search_process_cb,
>> !W1_SLAVE_ACTIVE, calling w1_slave_detach
>> [184199.510276] w1_slave_driver 10-000802d9c9e4: w1_slave_detach
>> destroy_now 1
>> [184199.510297] w1_slave_driver 10-000802d9c9e4: w1_unref_slave refcnt 0
>> [184199.510321] w1_slave_driver 10-000802d9c9e4: w1_unref_slave: detaching
>> 10-000802d9c9e4 [d5fb8800].
>> [184199.510347] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
>> w1_family_notify
>> [184199.510365] w1_slave_driver 10-000802d9c9e4: w1_family_notify calling
>> remove_slave
>> [184199.510382] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
>> [184199.510400] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
>> refcnt -1
>> [184200.049745] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.137133] w1_slave_driver 10-000802d9c9e4: w1_slave_show (before
>> sleep), family_data==0
>> [184200.889551] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.930866] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.930907] w1_slave_driver 10-000802d9c9e4: Read failed CRC check
>> [184200.931002] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
>> device_unregister
>> [184200.931169] w1 w1_unref_slave -> kfree
>>
>> Note: When this crash happened, multiple threads were reading the sensor.
>>
>>
>> I could trigger the problem several times, and each time device_unregister
>> in w1_unref_slave was executed  *after*  w1_slave_show. In one case, the
>> logged time-difference between the first family_data==0 message and
>> device_unregister was about 8 seconds!
>> I have not observed a w1_slave_show call after w1_unref_slave (as long as
>> the device was not re-attached again).
>>
>> >From my observation, the w1_slave data seem to be valid as long as
>> w1_slave_show is executed.
>> My guess is that the call of sysfs_remove_groups in w1_family_notify hits a
>> mutex (I did not add debug output here).
>>
>>
>> (*) On my tiny raspberry, this happens from time to time with high CPU and
>> external disc load (timing and/or electrical issues); it seems that the
>> sensor does not respond in time to the (periodic) search.
>>
>>
>> Please email me if you need further information.
>>
>>
>>
>> Best regards,
>> Thorsten Bschorr
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> --
> David Fries <david@fries.net>    PGP pub CB1EE8F0
> http://fries.net/~david/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/