A20 Lime CAN: controller-problem{rx-overflow}

Started by bart, October 26, 2016, 02:32:20 PM

Previous topic - Next topic

bart

Hi everyone,

after some trial and error, I got the sun7i_can Kernel module working with an A20 Lime1 Board. I am using the sunxi Kernel, version 3.4.104+.

But the driver seems very flaky. There are two issues:

1. I connect everything correctly to a CAN bus where another device is send data. I run the following command:
ip link set can0 down
ip link set can0 type can bitrate 500000 loopback off
ip link set can0 up
candump -cae can0,0:0,#FFFFFFFF

Nothing happens, until I run
cansend can0 5A1#A5 (just a dummy value)
in another terminal. Then the data starts pouring out.

2. This is the more serious issue:
After I start receiving data as described above, the connection works for a while. Then there are more and more errors like this:
can0  20000004   [8]  00 01 00 00 00 00 00 00   ERRORFRAME
        controller-problem{rx-overflow}
This seems to be same problem described here:
https://sourceforge.net/p/can4linux/discussion/1013310/thread/2eeb9098/#de53

In particular, the issue occurs more often if I dump the output to the console instead of a file, but it occurs in both cases. Once the issue arises, no data is received anymore, only overflow errors.

500kbit/s does not seems like a lot of data for a 1Ghz CPU, how can I avoid these overflow issues?


Thank you very much,
Bart

KeesZagers

1. Everything connected correctly. How many nodes are on the bus? Are all CAH Hi's connected together? The same for the CAN Lo's and the CAN Gnd's? Are both ends of the bus terminated by 120 ohm resisters?

2. The bitrate of 500 kB/sec is not relevant in this case. The CAN controller is handling these bits. Relevant is how many messages are sent on the bus per second. The CPU handles the message when it is received completely and correct. It is always nice to test a new node, if you already have a working network. If you have only one other node, it cannot send its message until your test node is connected. It will repeat trying until it receives an ACK from your node. In the document you referred to the user sent 2000 messages per second. This is a quite high busload. Are you also testing with so many messages per second. In that case I can imagine that you get overflow errors. If the FIFO is full and you keep sending the messages at this speed, it will never get the chance to get empty again, so I can imagine that only overflow errors keep coming.

3. I think that one of the developers of CAN4LINUX (Heinz) is also watching this forum. Maybe he has more information.

bart

Thank you very much for your reply, I appreciate the input.

1. There a handful of nodes on the bus, no more than 10 though. The physical connection should be ok, since I can receive messages, shouldn't it?

2. I would assume the bitrate is relevant, since it limits the message rate. If 2000 messages per second can be sent without saturating the bus, than the CAN receiver should be able to process them, shouldn't it? In either case, the target application is connecting to a machine to read telemetry data and we have no influence on the message rate.

3. I would very much appreciate any help.

KeesZagers

1. OK

2. What I meant to say is: You can have a 500 kbit/sec network with only one message per second. The driver should not have any problem with that, because it will be activated only once per second. The CANcontroller hardware will receive a burst of about 100 bits during the message and after the 200 uSec it will set the message available in a FIFO. The driver has in this case 1 second the time to read it.

3. Unfortunately I don't know the details of the CAN4Linux driver, but I will kick Heinz through my hotline :)

JohnS

bart - which CAN transceiver are you using?

Also, how many nodes?

(Maybe I'm not understanding the setup?)

John

bart

The transceiver is the Olimex board https://www.olimex.com/wiki/A20-CAN (MCP2551).

Currently we connect that to a rack that simulates a car with a few controllers on it, so I'm not 100% sure of the number of nodes on the bus, but it's definitely less than 10. The bus works fine otherwise, the existing nodes show no errors, it's only reading that fails.

Thank you very much,
Bart

JohnS

#6
Thanks.

Yes that ought to work so you may have found a (software) bug.  I would have suggested contacting the driver maintainer (Heinz) directly but seems already to be in progress.

If there are any non-default values you've used then I guess Heinz would want to know them.

EDIT: er, I'm not sure it's Heinz.  He does can4linux but I see you're using the mainline driver.

Are you able to at least add some debug printk's etc to the driver?  Or try a changed version if Heinz (or anyone) had one to try?
(Essentially, this means rebuilding the kernel which can sound scary but isn't really.)

John

bart

Yes, I've been compiling the kernel anyway. We are still on the sunxi branch kernel, so the CAN driver is not included, I'm using the one from https://github.com/btolfa/sunxi-can-driver (as suggested by the Olimex documentation). I suspect that driver may be broken?

I'm trying to build the mainline kernel right now, to see if that works.

JohnS

#8
You might try an email to the person shown in the source (Peter Chen).

BTW there's a fair chance the driver gets an interrupt about the problem but is perhaps not handling it properly (or at all?).

John

bart

I'm not very familiar with drivers, but from what I understand the driver does handle the error interrupt:

    if (isrc & DATA_ORUNI) {
                /* data overrun interrupt */
                netdev_dbg(dev, "data overrun interrupt\n");
                cf->can_id |= CAN_ERR_CRTL;
                cf->data[1] = CAN_ERR_CRTL_RX_OVERFLOW;
                stats->rx_over_errors++;
                stats->rx_errors++;
                sun7i_can_write_cmdreg(priv, CLEAR_DOVERRUN);        /* clear bit */
}


It looks like maybe clearing the flag doesn't always work though?

In any case, I'm away on a holiday for a while, so my colleagues will take it from here.

Again, thank you very much!

JohnS

hmm, your reported error does not appear to match that code fragment.

Have a good hol!

John

KeesZagers

Heinz is informed and will probably react soon.

I'm a bit confused after looking at the sunxi driver. I'm missing the link to CAN4Linux, however in your first message you referred to the Sourceforge issues about CAN4Linux with the same problem.

In the mean time it would be good to know how many messages per second are coming over the bus. E.g. if this is more than 1000 messages per second, I can imagine that the driver will be overloaded. And if you get an overflow and the bus remains at that high speed, all the next messages will get that overflow also.

Have a nice holiday.

Kees

Heinz

Hello,

thanks to Kees I'm now following your discussion.

William, who contacted me as can4linux maintainer,  is using the A20 and he tried both, can4linux and SocketCAN but both failed on receiving. After some more or less long time the receiver stops because the CAN controller gets no interrupts any more after some time.
We did exchange a lot of messages and did change the can4linux code, but without success.
This is what he wrote in his last email:

"After spending about many hours on this problem, I've basically all but
confirmed this is a hardware problem."

I will point William to this forum, may be he will share more information with you.

Heinz

JohnS

You don't think it's that an interrupt came in but was not handled fully so after that does not come in again?

John

Heinz

@JohnS
if you were asking me?
No that is may be to trivial error. As far as I know the last test used heavy traffic and the "hanging" happens not so often.
If the receive interrupt is not reset correctly it should happen more often, may by already at the second message on the bus.