This is great news! I'm crossing my fingers and hoping that Nils can't
reproduce the crash any more with Soren's fix.
Just to let you all know, Nils has been working his ass off helping me
track his crash down. I've been pulling my hair out... I gave him patch
after patch to test various conditions & panic if the nfs_node's hash list
somehow got broken, and for the last week not a single one of those tests
detected the problem prior to the panic. The nfs_node's hash list
was being corrupted seemingly out of nowhere.
The last two days I've had Nils use hardware watchpoints in DDB> to
try to track down what was modifying the memory location, with no
success. The watchpoint was catching the (correct) write to the list
head but then failed to catch the corrupted write prior to the system
panicing, which is what makes me believe it is some sort of chipset
Another thing to note: One of the really weird things about Nils crashes
is that the same memory location was getting corrupted every time, five
times in a row (which made it possible to use a hardware watch point).
The corruption changed somewhat when he added the hardware watch point.
Another similar set of crashes in the vm_page_list (that other people
report, including a number of machines at Yahoo), have a similar M.O....
IDE drive, medium/heavy activity, but while corrupted address always
winds up in the (static) vm_page array, it always tends to be slightly
different. I'm hoping that it winds up being the same or similar
issue. I'm not ruling out the possibility that chipsets other then
the 686B have problems too.
In anycase, Nils description makes a lot of sense. I've asked him to
continue testing his system to make sure that this particular crash cannot
be reproduced, and I am crossing my fingers.
I'm also wondering how applicable this patch might be in regards to
forcing a 'safe' mode for other PCI chipsets, to allow us to test
it on non-686B machines that have similar problems.
:On Thu, Dec 27, 2001 at 10:45:01AM +0100, Sren Schmidt stood up and spoke:
:> OK, here goes the VIA 686b patch, it is hand cut out from the bulk patches
:> to go into 4.5 so beware :)
:Well, as Matt has said, I reported a crash that he's trying to debug. Since
:I have the 686b in my machine, I applied the patch. Ever since then I was
:not able to reproduce the crash again, although yesterday it was so easy
:that I could do it twice an hour ;-)
:Anyway, you (Soren) said that the right way to fix this is a BIOS update.
:Now, could it be that some mainboard manufacturers are incapabel of
:handling this? I'm using the latest BIOS for my board, and according to
:http://www.chaintech.com.tw/DL/7xMB/7AJA0.HTM, this should already have
:been fixed in their BIOS release from 2001-04-23...
:Second interesting thing: I was using a UDMA66 drive on my 686b until a few
:weeks ago and never had any problems - the stuff Matt is looking at only
:started two appear a short while after I exchanged that drive for a UDMA100
:one. So, it seems as if probably the slower drive didn't produce a high
:enough PCI workload for anything to actually happen.
:This fix will probably also have some influence on a few other similar
:problems (I read Matt was working on many of them). In the end I hope that
:this fix - or a variation thereof - will actually go into 4.5.
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message