iMX233-OLinuXino-MICRO stability issues (kernel oops)

Started by Kean, September 23, 2012, 02:06:28 PM

Previous topic - Next topic

dpwhittaker

The second micro does seem to be having the same issue after more extensive testing.  I guess I just got lucky at first:


[  181.210000] Unable to handle kernel paging request at virtual address fffffaab
[  181.220000] pgd = c2cbc000
[  181.220000] [fffffaab] *pgd=43ffe831, *pte=00000000, *ppte=00000000
[  181.230000] Internal error: Oops: 17 [#1] ARM
[  181.230000] Modules linked in:
[  181.230000] CPU: 0    Not tainted  (3.6.0-rc2-09647-gddee6b1-dirty #2)
[  181.230000] PC is at fget_raw_light+0x78/0x118
[  181.230000] LR is at lockdep_init_map+0x3c/0x484
[  181.230000] pc : [<c00c6404>]    lr : [<c0056d9c>]    psr: 60000013
sp : c3a2be88  ip : 00000000  fp : 00000041
[  181.230000] r10: ffffff9c  r9 : c3a2a000  r8 : c3b04000
[  181.230000] r7 : c04dfadc  r6 : c3b4f800  r5 : 00000000  r4 : c2c8b3c0
[  181.230000] r3 : c2c8b474  r2 : c2c8b46c  r1 : c041da58  r0 : 00000000
[  181.230000] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  181.230000] Control: 0005317f  Table: 42cbc000  DAC: 00000015
[  181.230000] Process top (pid: 371, stack limit = 0xc3a2a270)
[  181.230000] Stack: (0xc3a2be88 to 0xc3a2c000)
[  181.230000] be80:                   c3a2bf78 c3a2bf00 00000001 00000000 c3a2bf78 c00d2998
[  181.230000] bea0: 00000024 3342967e 00000400 be9fef40 00000068 00000000 c3a2bf50 00000000
[  181.230000] bec0: c3828f14 c3a2a000 00000000 00000000 00000002 c3a2bf78 00000001 c3b04000
[  181.230000] bee0: ffffff9c ffffff9c c3a2a000 00000000 0001f510 c00d3094 00000041 000000d0
[  181.230000] bf00: 60000013 00000000 c3a2a000 c3828f04 00000002 00000000 c3828f04 00000000
[  181.230000] bf20: 00000000 c3828f04 c2c89040 00000000 c3828f04 00000000 00000008 00000008
[  181.230000] bf40: 0001f510 c03457b4 c3828ee0 c00de69c c3b04000 00000000 b6f5fb50 c3b04000
[  181.230000] bf60: 00000000 00000008 00000001 c00c41b4 00009bd4 ef000000 00000000 c0050000
[  181.230000] bf80: 00000024 00000100 00bb19a0 b6f5f650 00bb19a0 b6f5fb50 00000005 c000e9c8
[  181.230000] bfa0: 00000000 c000e820 b6f5f650 00bb19a0 b6f5f650 00000000 00000000 b6f5f65d
[  181.230000] bfc0: b6f5f650 00bb19a0 b6f5fb50 00000005 b6f5e8c8 00009bd4 00ba73c8 0001f510
[  181.230000] bfe0: 00000000 be9fef94 b6f4fc5c b6e5a68c 60000010 b6f5f650 43ffe831 43ffec31
[  181.230000] [<c00c6404>] (fget_raw_light+0x78/0x118) from [<be9fef40>] (0xbe9fef40)
[  181.230000] Code: e1560002 2a000017 e5933004 e7934106 (e3540000)
[  181.420000] ---[ end trace e700b39da7a4b8f3 ]---
[ 1235.480000] Unable to handle kernel paging request at virtual address bf836d40
[ 1235.480000] pgd = c2cbc000
[ 1235.480000] [bf836d40] *pgd=00000000
[ 1235.480000] Internal error: Oops: 80000005 [#2] ARM
[ 1235.480000] Modules linked in:
[ 1235.480000] CPU: 0    Tainted: G      D       (3.6.0-rc2-09647-gddee6b1-dirty #2)
[ 1235.480000] PC is at 0xbf836d40
[ 1235.480000] LR is at 0xbeaa00d0
[ 1235.480000] pc : [<bf836d40>]    lr : [<beaa00d0>]    psr: 00000013
sp : c384feb0  ip : 00000000  fp : 00000041
[ 1235.480000] r10: ffffff9c  r9 : 41000000  r8 : c39fc000
[ 1235.480000] r7 : 00000400  r6 : 3342967e  r5 : 00000024  r4 : 30610000
[ 1235.480000] r3 : 00000000  r2 : c3bfcbec  r1 : c041da58  r0 : 00000000
[ 1235.480000] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 1235.480000] Control: 0005317f  Table: 42cbc000  DAC: 00000015
[ 1235.480000] Process top (pid: 372, stack limit = 0xc384e270)
[ 1235.480000] Stack: (0xc384feb0 to 0xc3850000)
[ 1235.480000] fea0:                                     00000068 00000000 c384ff50 00000000
[ 1235.480000] fec0: c38288f4 c384e000 00000000 00000000 00000002 c384ff78 00000001 c39fc000
[ 1235.480000] fee0: ffffff9c ffffff9c c384e000 00000000 0001f510 c00d3094 00000041 00000000
[ 1235.480000] ff00: 60000013 00000000 00000024 c38288e4 00000002 00000000 c38288e4 00000000
[ 1235.480000] ff20: 00000000 c38288e4 c2c89040 00000000 c38288e4 00000000 00000008 00000008
[ 1235.480000] ff40: 0001f510 c03457b4 c38288c0 c00de69c c39fc000 00000000 b6f41b50 c39fc000
[ 1235.480000] ff60: 00000000 00000008 00000001 c00c41b4 00000000 00000000 00000000 33420000
[ 1235.480000] ff80: 00000024 00000100 001e4490 b6f41650 001e4490 b6f41b50 00000005 c000e9c8
[ 1235.480000] ffa0: 00000000 c000e820 b6f41650 001e4490 b6f41650 00000000 00000000 b6f4165d
[ 1235.480000] ffc0: b6f41650 001e4490 b6f41b50 00000005 b6f408c8 00009bd4 001d93c8 0001f510
[ 1235.480000] ffe0: 00000000 bea9ff94 b6f31c5c b6e3c68c 60000010 b6f41650 55555555 55555555
[ 1235.480000] Code: bad PC value
[ 1235.480000] ---[ end trace e700b39da7a4b8f4 ]---


Whatever is happening, it seems to be pretty universal, at least on the batch I got.

Kean

Thanks to everyone who has contributed to this discussion.  I've had to concentrate the last few days on  reworking our demo hardware so that we can run on the Maxi, which has proven to be very stable.  In fact I am not sure if I've ever seen an oops on the Maxi, and certainly not during heavy development on 2 boards over the last week.  I've just got another 5 Maxis to replace the Micros that I can't use.

I agree that power issues can and do cause problems, and I will try adding some additional low ESR bulk capacitance on the board tonight and leave it running.  But I'm not convinced that is going to fix the issue - there appears to be significant onboard capacitance for the included circuitry.  I've seen the problem on a board straight out of the box with no extra software or hardware attached apart from serial cable and power.

For the same reason I am not sure mmap'd GPIO makes a difference.  I am using that anyway, but as I mentioned I see this problem even without running any additional code.  I guess the demo blinking light in the Olimex rootfs image does use non-mmap'd GPIO.

In regards to microSD cards, I've seen the oops problem occur when using a microSD card from Olimex.  I've had many more serious problems when using some other cheap microSD's - causing boot problems on power up, which require ejecting and reinserting the card then a reset.  Since switching to better quality cards the boot issue has gone away.

I have built  a copy of ksymoops, but I don't have the kernel symbols to make use of it.  Most of the oops that I've recorded are for an older kernel that what I am now developing with on the Maxi.  The oops seem to be totally random, and often I will just get a lock up with no oops data.

Please keep the discussion and testing going.  This forum is great.

Kean

olimex

Hi
we managed to reproduce this kernel oops issues in our lab and how we are going to investigate for the root of the cause
Tsvetan

olimex

can you check your boards please: with kernel generated with LTIB we can't see such problems, the problems occur only with the new kernel generated with OE

dpwhittaker

#19
I downloaded Raivis's original LTIB package from his google drive, dd'd the kernel and unpacked the rootfs then ran top -d 0.

The first thing that I noticed was that this older version of top redraws the whole screen every time, where the newer versions only redraw the lines that change.  This means the older version spends much more time waiting on the serial port, and much less time stressing the processor and memory.

Nevertheless, after a few minutes of stopping and starting top and looking at its settings, I finally tried on "top -d 0 -n 0" which seemed to do exactly what "top -d 0" does.  Either way, this run eventually failed with this Oops:


Internal error: Oops - undefined instruction: 0 [#2] PREEMPT
Modules linked in:
CPU: 0    Tainted: G      D     (2.6.31-626-g602af1c_OLinuXino #6)
PC is at 0xbee2ef9c
LR is at vsnprintf+0xc1c/0xdd8
pc : [<bee2ef9c>]    lr : [<c0141638>]    psr: 20000013
sp : c3873b70  ip : c3873ce4  fp : ffffffff
r10: c3873d30  r9 : c02c94f6  r8 : c3c62000
r7 : 00000000  r6 : 00000001  r5 : c3873d34  r4 : 00000010
r3 : 00000000  r2 : 00000001  r1 : c3c63000  r0 : c3c62000
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 0005317f  Table: 43c08000  DAC: 00000015
Process top (pid: 675, stack limit = 0xc3872270)
Stack: (0xc3873b70 to 0xc3874000)
3b60:                                     00000010 00000002 ffffffff 0000000a
3b80: ffffffff ffffffff 00000000 00001000 c3c62000 c3c63000 00000000 ffffffff
3ba0: 00000002 ffffffff c3cb30a0 c015807c c3873f88 c3873e60 00000000 c3897c60
3bc0: 00000000 c009c43c 00000001 bee2fac4 00000000 00000000 ffffffff c3873e60
3be0: 00000000 00000001 00000000 00000001 000200da c009cff0 ffffffff 00000000
3c00: c38dc6e0 00000000 00000000 00000000 00000001 00000001 ffffffff c034b38c
3c20: 1e65fb80 000200da 000200da 00000000 c034b944 c034b948 00000001 c384a340
3c40: 00000002 c0070cf8 00000002 00000041 c034b38c 00000002 c3c63000 00000000
3c60: 00000000 000200da 00000002 ffffffff 00000000 37350000 00000036 c3cb3760
3c80: 00000000 00000000 00000000 c034b38c c3872000 00000000 c3873cce c3873f23
3ca0: c3873cce c3873f23 00000086 c00281ec c3873f30 c01408f8 00000000 00000000
3cc0: 00000000 00000000 c3809ff8 c380a000 c380e000 00000010 00000002 ffffffff
3ce0: 0000000a ffffffff ffffffff c02c94f6 c003ea70 c3882860 00000000 00000000
3d00: 00000000 00000000 00000000 00000000 c3849bc0 c00a7418 c0024904 c3873d30
3d20: 00000001 00000000 c00d1318 c02c94f4 00000001 c3873ea4 00000053 00000000
3d40: 00000001 00000001 00000000 ffffffff 00400100 00000077 000023cb 0000000a
3d60: 00000010 00000000 0000005a 00000022 00000055 00000014 00000000 00000001
3d80: 0000001a 00000000 001f9000 0000007f ffffffff 00008000 00090a94 bea1ef00
3da0: bea1eae8 400ae9c8 00000000 00000000 00000000 00084a07 c003ea70 00000000
3dc0: 00000000 00000000 00000000 00000000 00000000 c0070814 00000000 00000000
3de0: 00000000 00000000 c3c44dc0 c009ed90 00000001 c3809880 c3882860 00000001
3e00: 00000014 00000000 00000001 00400100 00000000 0000005a 00000022 00000055
3e20: 0000007f 00008000 00090a94 bea1ef00 00000001 00000000 00000000 001f9000
3e40: 400ae9c8 bea1eae8 c003ea70 ffffffff 00000000 00000053 00000000 00000001
3e60: 00000001 00000001 000023cb 00000010 00000022 00000055 00000000 ffffffff
3e80: 0000001a 00000000 0000000a 00000077 00000000 0000005a cd0a3c00 00000000
3ea0: c034b38c 74696e69 00000000 00000000 00000000 00084a07 00000000 00000000
3ec0: 00000000 a0000013 00000000 c380a000 fffffffd c3882860 c3fb4ca8 000003ff
3ee0: c3873f80 00000000 c3882600 c00d135c 00000001 c032ec44 c3809880 c00ce48c
3f00: c3882600 c3882860 bee2f5d8 00000001 00000000 c00a79d8 000003c3 bee2f5d8
3f20: 00000000 c3882888 bee2f9d8 c034b38c 00000000 00000000 c38011a0 c3882600
3f40: bee2f5d8 c3873f80 00000000 000003ff c3872000 bee2fa49 00000005 c008e098
3f60: 00000000 c3882600 c3882600 fffffff7 00000000 00000000 c0023f64 c008e484
3f80: 00000000 00000000 00000000 00000000 bee2f5d8 ffffffff 00000004 bee2f5d8
3fa0: 00000003 c0023de0 ffffffff 00000004 00000004 bee2f5d8 000003ff 00000000
3fc0: ffffffff 00000004 bee2f5d8 00000003 00003633 00003633 bee2fa49 00000005
3fe0: 00000000 bee2f5c8 00075538 400d17ac 60000010 00000004 aaaaaaaa aaaaaa8a
Code: 00000000 00000000 00000000 00000000 (ffffffff)
---[ end trace 6587df8e926aeb33 ]---


It also managed to turn off echo on the serial port, but I was able to type dmesg blindly to pull up a clean copy of the Oops.

I found a simpler way to reproduce on that kernel as well:

stty rows 5
top -d 0

And got another Oops:


Unable to handle kernel paging request at virtual address bebeb274
pgd = c38e8000
[bebeb274] *pgd=43c2b031, *pte=00000000, *ppte=00000000
Internal error: Oops: 0 [#1] PREEMPT
Modules linked in:
CPU: 0    Not tainted  (2.6.31-626-g602af1c_OLinuXino #6)
PC is at 0xbebeb274
LR is at vsnprintf+0xc1c/0xdd8
pc : [<bebeb274>]    lr : [<c0141638>]    psr: 20000013
sp : c3843b70  ip : c3843ce4  fp : ffffffff
r10: c3843d30  r9 : c02c94f6  r8 : c3c2e000
r7 : 00000000  r6 : 00000001  r5 : c3843d34  r4 : 00000010
r3 : 00000000  r2 : 00000001  r1 : c3c2f000  r0 : c3c2e000
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 0005317f  Table: 438e8000  DAC: 00000015
Process top (pid: 660, stack limit = 0xc3842270)
Stack: (0xc3843b70 to 0xc3844000)
3b60:                                     00000010 00000002 ffffffff 0000000a
3b80: ffffffff ffffffff 00000000 00001000 c3c2e000 c3c2f000 00000000 ffffffff
3ba0: 00000002 ffffffff c3c7f260 c015807c c3843f88 c3843e60 00000000 c3c6e360
3bc0: 00000000 c009c43c 00000001 beb1dac4 00000000 00000000 ffffffff c3843e60
3be0: 00000000 00000001 00000000 00000001 000200da c009cff0 ffffffff 00000000
3c00: c3c90520 00000000 00000000 00000000 00000001 00000001 ffffffff c034b38c
3c20: 23c34600 000200da 000200da 00000000 c034b944 c034b948 00000001 c38f41c0
3c40: 00000002 c0070cf8 00000002 00000041 c034b38c 00000002 c3c2f000 00000000
3c60: 00000000 000200da 00000002 ffffffff 232b29f2 3630005f 00000036 c0351080
3c80: 00000103 00000081 00000001 c3c7f660 00000000 00000000 00000000 00000001
3ca0: c3842000 c380a000 00000086 c00281ec c3842000 c032f3cc c3c7f660 00000000
3cc0: 00000001 00000000 c3809ff8 c380a000 c380e000 00000010 00000002 ffffffff
3ce0: 0000000a ffffffff ffffffff c02c94f6 c003ea70 c3c42360 00000000 00000000
3d00: 00000000 00000000 00000000 00000000 c3c84780 c00a7418 c0024904 c3843d30
3d20: 00000001 00000000 c00d1318 c02c94f4 00000001 c3843ea4 00000053 00000000
3d40: 00000001 00000001 00000000 ffffffff 00400100 00000077 0000246f 0000000a
3d60: 00000010 00000000 0000005a 00000021 00000061 00000014 00000000 00000001
3d80: 0000001a 00000000 001f9000 0000007f ffffffff 00008000 00090a94 bed8df00
3da0: bed8dae8 400ae9c8 00000000 00000000 00000000 00084a07 c003ea70 00000000
3dc0: 00000000 00000000 00000000 00000000 00000000 c0070814 00000000 00000000
3de0: 00000000 00000000 c3c25608 c009ed90 00000001 c3809880 c3c42360 00000001
3e00: 00000014 00000000 00000001 00400100 00000000 0000005a 00000021 00000061
3e20: 0000007f 00008000 00090a94 bed8df00 00000001 00000000 00000000 001f9000
3e40: 400ae9c8 bed8dae8 c003ea70 ffffffff 00000000 00000053 00000000 00000001
3e60: 00000001 00000001 0000246f 00000010 00000021 00000061 00000000 ffffffff
3e80: 0000001a 00000000 0000000a 00000077 00000000 0000005a cda2d280 00000000
3ea0: c034b38c 74696e69 00000000 00000000 00000000 00084a07 00000000 00000000
3ec0: 00000000 a0000013 00000000 c380a000 fffffffd c3c42360 c38ecb08 000003ff
3ee0: c3843f80 00000000 c3c42280 c00d135c 00000001 c032ec44 c3809880 c00ce48c
3f00: c3c42280 c3c42360 beb1d5d8 00000001 00000000 c00a79d8 000003c2 beb1d5d8
3f20: 00000000 c3c42388 beb1d9d8 c034b38c 00000000 00000000 c38011a0 c3c42280
3f40: beb1d5d8 c3843f80 00000000 000003ff c3842000 beb1da49 00000003 c008e098
3f60: 00000000 c3c42280 c3c42280 fffffff7 00000000 00000000 c0023f64 c008e484
3f80: 00000000 00000000 00000000 00000000 beb1d5d8 ffffffff 00000004 beb1d5d8
3fa0: 00000003 c0023de0 ffffffff 00000004 00000004 beb1d5d8 000003ff 00000000
3fc0: ffffffff 00000004 beb1d5d8 00000003 00003633 00003633 beb1da49 00000003
3fe0: 00000000 beb1d5c8 00075538 400d17ac 60000010 00000004 c0248bc4 c0248bd4
Code: bad PC value.
---[ end trace 3788ceb25656dbec ]---


The idea behind that method is to limit the amount of rows top can show so it spends more time stressing the cpu and/or memory and less time shoving bytes down the DUART.  Give that a try and see if you can reproduce on LTIB.

EDIT: With all that said, I think it goes without saying that "use LTIB", a distro that you have considered obsolete from day one, is not a solution.  Good luck on the lab testing - I hope you find a simple solution.

olimex

perhaps you do not understand the meaning of my previous post, I do not say "use LTIB" but we have to identify where the problem comes from, here in the lab MICRO which had kernel oops with ARCH image runs stabile for hours mplayer with the LTIB kernel, so we were thinking if the problem is not related somehow to the kernel image, we will write ARCH image to the SD card which now have LTIB so we are sure this is not related to the SD card media either

dpwhittaker

You are right, I did misinterpret your message.  Being a software guy, I'm used to people saying "it works fine on my software, so it must be something wrong with YOUR software".  Please forgive me for jumping to that conclusion here.

I also realize I am probably coming across as saying "it is broken on all my software, so it must be something wrong with YOUR hardware".  Please don't take it that way either - there is a lot of shared code between all the kernels, so there is still room for a software issue.  I'm kind of hoping for a software issue that I can just apply a patch and continue on with the hardware I have, but I can't see any common threads that point to a specific software issue, so I'm worried it could be something more fundamental.

I have now seen this issue on LTIB, OE, and linux-mainline.  LTIB does seem to take longer to show an issue, but it also seems to be all-around slower... I can't really put my finger on it, but perhaps it is just stressing the hardware less somehow, or perhaps its default configuration does not include some of the modules that are causing the issue.  Maybe busybox just isn't hitting the same code paths that standard linux packages are hitting.  There are so many variables in a software package the size of linux.

Have you tried taking the arch image which easily showed the Oops and plugging that same SD card into a Maxi?  If it gives no errors for a reasonable amount of time, then that basically narrows the problem to the differences between the Maxi and the Micro:

Board layout - trace lengths, capacitance, EMI between the CPU and memory so close together, dunno what else
The USB Hub/Ethernet chip - maybe something in the bootloader or kernel is attempting to initialize or interact with it and causing errors when it doesn't respond
The USB/Ethernet/Audio headers - it's a difference, don't know how leaving these unplugged would cause a problem
What other differences are there between the maxi and micro?

Any other suggestions on things to try?  I'll help wherever I can, especially on software-related things, although it will probably be tomorrow evening (GMT-6) before I get any more free time.

Kean

@olimex @dpwhittaker

I've been seeing the oops on the Micro using ARCH - so it isn't restricted to OE.  I've not seen a problem on the Maxi using the same image, or even the same SD card, so I think it is related to a hardware difference.  Other than the EMI/DDR traces, maybe a non terminated/floating input ?

The oops (or lockup) will occur even if you don't try "stress" the system (e.g. running top), but it can take maybe 2 or 3 hours.

It is a pity the Micro has this problem, but we love the Maxi!

Kean

LubOlimex

#23
Guys,

So far we've been able to increase the stability by decreasing the EMI speed to 96MHz down from 133MHz, using the old fsl image. We are still testing the stability but decreasing the EMI speed actually decreased the load on the chip by 50% (while doing top -d 0).

We will continue testing.

Edit: Thanks for correction, it's EMI speed of course
Technical support and documentation manager at Olimex

dpwhittaker

#24
I believe you mean the EMI (dram) clock, not the CPU clock.  The CPU clock is set at 454Mhz by default (and seems to be hardcoded with no other easily configurable option, at least on Freescale's bootlets bootloader).  On the other hand, the EMI clock can easily be configured to select either 96MHz or 133MHz with a simple define.

While I can't find my way around the fsl toolchain to save my life, I have figured out how to implement this change on koliqi's linux-mainline toolchain.  Simply go into his boot/imx-bootlets-src-10.05.02/boot_prep/init-mx23.c file, and uncomment line 34 (#define EMI_96M).  Switch back to the imx-bootlets-src-10.05.02 and:

make CROSS_COMPILE=arm-linux-gnueabi- clean (or arm-none-eabi- if you are on ubuntu and went that route)
make CROSS_COMPILE=arm-linux-gnueabi-

dd if=sd_mmc_bootstream.raw of=/dev/sdX1  (where X is your sd card's letter)

Put your sd card back in the micro and reboot.

I've been running top -d 0 for over 30 minutes while writing this post, and have not had any failures yet, so, if indeed this is the equivalent action to what you've done only on linux-mainline, then I can corroborate your testing and say that this workaround does indeed seem to increase stability.

I don't necessarily like that I have to run my memory at 75% speed to keep everything stable, but I suppose there is a price to pay for getting linux into this size and price range.  Still, I'm holding out hope that you can find a solution that allows the micro to run at full speed.

EDIT: 1:24 and top is still going strong.  Will leave it running overnight and see if or when it fails in the morning, and again tomorrow evening (GMT -6) if it is still going in the morning.

EDIT 2: 8 Hours and still going strong at 96MHz.

Kean

Looking forward to hearing more reports of testing with the slower EMI DRAM speed.  If someone can make an image available with that, I can do some testing here - I don't have time to rebuild the kernel at the moment.

FWIW, I added 100uF tantalum cap to the 5V innput near the power connectot (ESR of 85mohm).  I also changed the power connections from the lab supply from 24AWG to 16AWG.  I left top running so I could see the uptime when it failed.

It ran for 9 hrs 28 mins before oops.  Not sure if that is a useful datapoint - only one sample.

Kean

LubOlimex

I will be testing stability with different capacitors today. As a start will replace some of the 100nF ones with 220nF and see how it goes with different images.

Lub/OLIMEX
Technical support and documentation manager at Olimex

davidjf2001

I still find it strange that memory issues can be the direct cause of this.  Memory issues like this typically cause unrecoverable failures.  Maybe crosstalk from the memory bus is influencing other signals to cause the errors that the CPU can recover from.  I would leave the memory at high speed to increase the rate of the errors and do my best at tracing through what the messages indicate.  Also rebuilding the kernel to remove all modules not absolutely necessary may shed some light.

LubOlimex

A little update: so far increasing the capacitors around the memory didn't improve the stability (problem ~30 min of top -d 0) but removing R17 seems to improve it. So far being running for 4 hours without a problem. Will continue testing tomorrow.

Lub/OLIMEX
Technical support and documentation manager at Olimex

dpwhittaker

#29
Alright, I've got some more free time this weekend, and have logged over 36 hours of uptime with top -d 0 running with the memory at 96MHz.  So if nothing else, this seems to solve the problem.  I'll also try popping off R17 and running at full speed.  Hopefully that will prove to be the simple modification that fixes the problem permanently.

EDIT: The hardware mod was easier than I imagined.  R17 is at the end of a row of resistors and capacitors, and even despite its tiny size, was relatively simple to remove with a fine-tip soldering iron.  It may be a little more difficult to put back if these tests fail, but I think I can handle it.

So, I'll work with it today without the resistor.  I'm writing an LRADC driver (from scratch, not based off the IIO version from Marek - I wanted one that would support continuous readings, delay channels, and oversampling, and I don't have time to learn the IIO subsystem, so I'm writing a generic character device driver - /dev/lradc0-8).  When I finish for the day, I'll kick off top and let it continue to stress test the system.  This should give us a mix of real-world and artificial tests to ensure that this fix really does solve the problem before everyone else goes off and mods their board.

top -d 0 has been running continuously for 10 minutes while writing this post, so it looks like we are on the right track.

EDIT 2: I was able to "make scripts" in my kernel source folder without error for the first time with the resistor removed.  I was also able to build two kernel modules onboard.  I had to enable swap to get the module to build (don't worry, I'm using a million write cycle sandisk SD), so I know I'm stressing the memory.  More testing to come, but I think this does the trick.

EDIT 3: After a full day of development with the resistor removed, there were no unexplained kernel oops (several were caused by bugs in my module, but all happened as a direct result of reading or writing to the character device connected to my driver, so they were obviously caused by bugs in my software and not memory timing issues).  Today's test included many iterations of editing, compiling, inserting, and testing my kernel module, so there was a good bit of real-world stress put on the micro, and it worked like a champ.  I'll run top -d 0 overnight to get a good artificial test as well, but I'd be willing to bet it will still be going strong in the morning.