I thought I would update this issue as I have resolved it and it may help others to know how I got the resolution:
So, here is the topology across the 2 x DC sites:
THW:
I-Phone--CPE--LAC--LNS--CORE--Upstream ISP
HEX:
I-Phone--CPE--LAC--LNS--CORE--Upstream ISP
The two core routers at HEX and THW are both linked together via X-Connects between the DC's.
HEX did not have the issue, only THW did.
The first thing I had to look for was "what's the difference between HEX and THW"?
From a configuration perspective there was no difference. The only noticeable difference was that somewhere upstream, the route back to the CPE at THW was coming through HEX first. So, it would hit HEX and then traverse the X-COnnects across to THW and back to the CPE.
So, how do we troubleshoot this? I installed I-Perf on one of our radius servers (at HEX) and applied the I-Perf configuration on a laptop attached to CPE at THW. I saw the same issue. I should have been seeing between 7 and 9 MB on I-Perf and 0 retries, but what I actually saw was throughput of 0.234 --> 0.566 and multiple retries. Okay, great, I can replicate the problem. Now, is it isolated to THW or is it the X-Connects?
Next test to confirm X-Connects or local to THW was to install I-Perf on RADIUS server at THW and use from same THW Laptop. I saw the SAME results, which meant the problem was local and nothing to do with the X-Connects.
So, now it was down to interface level. First check the configs... nope, nothing wrong there. Then I completed the following command:
run show interfaces ae1 extensive ---- I saw the following in the output:
Input errors:
Errors: 15000, Drops: 0, Framing errors: 15000, Runts: 0, Giants: 0, Policed discards: 0, Resource errors: 0
So, okay. Which interface from the aggregation is it?
Close one and retest I-Perf..... I noticed the framing errors increase dramatically..... so, I swapped shutdown ports and the framing errors stopped. I-Perf had perfect responses now and no errors seen again.
Whether this is sfp, fiber or a problem with the physical port, we will only know once we can get to the data centres.
Just wanted to show the process I used for troubleshooting.
ADD ON: When I tested the other sites, it must have been pure luck as one of the interfaces on the aggregated link is fine... it depends on the format Juniper uses to pass traffic across an aggregated link --- maybe round robin, maybe load balanced.... I can look that up....