We have been fighting this issue for several months now and have narrowed it down somewhat.
We have 3 site Active Directory setup with 3 sites (HQ, BRANCH, COLO). The HQ and BRANCH sites use SonicWall firewwalls running 2.9.1 firmware. The COLO site uses a Juniper 350M running 6.3.0r18.0 firmware. All the sites are cross connected using site to site VPN connections such that all LAN addresses at each site are accessible by all other sites.
The initial symptom we discovered is that at various periods of time the Active Directory servers in HQ and BRANCH would cease to be able to replicate to COLO - but they would continue to replicate with each other. Typically around 6:30-7:30am each day whatever was causing the replication to fail would release and all the servers would resync up for a few hours and replication would work perfectly well. Then between 10:30am and 11:30am the replication would start to error out again and would often remain that way until the next morning.
In the process we tried promoting new hardware, changing the MTU on the AD servers, upgrading all the firewall firmware to the latest releases. None of these things has fixed the issue - although upgrading the firmware does seem to have caused the predictability of the outage to be less obvious (vs reliably working 3-5 hours each morning - it sometimes works later in the day for a short while).
This appears to be something related to the firewall blocking traffic - possibly just the DNS portion of the AD replication that is causing larger issues. And here is what we see.
* When replication is working - DNS traffic is working perfectly well. From any AD admin console I can connect to any AD server's DNS at any site location. Additionally all the command line tools for repadmin can freely connet between domain servers.
* When replication stops working - the command line repadmin tools are able to talk freely between Domain Controllers at the same sites - and also freely between HQ and BRANCH. But HQ <-> COLO and BRANCH <-> COLO are locked out. Additionally using the DNS management tool cannot seem to bridge that divide.
* BUT all other traffic seems to be functional over the site to site VPN. TCP Stream socket things like Remote Desktop, FTP, Telnet, etc work without issue. Windows file shared and DFS replication continue to work fine. PING traffic also flows freely. So the site to site VPN is functional.
* If I take my machine in the HQ location and VPN directly into a machine in the COLO location using PPTP it is able to connect to the remote DNS servers without issues.
So there doesn't appear to be any specific issues with the AD servers - the issue appears to be that somehow the site to site VPN is blocking some level of traffic causing the AD servers to fail to replicate. At a minimum DNS is important to AD replication and DNS traffic is clearly being blocked. But as was stated - all other traffic appears to be flowing perfectly fine during the "outage". And then it will magically self correct itself (typically early in the morning) without any use intevention.
Obviously I am using 2 different firewalls here so you could point the issue at either side. But the compelling factor that makes me belive it is on the Juniper side is that once replication and/or DNS traffic is blocked from COLO to HQ - it is also fails from COLO to BRANCH. If this were as simple as a state problem within the HQ Sonicwall firewall it shouldn't prohibit the COLO from talking to the BRANCH using their own direct site to site VPN. And the HQ to BRANCH replication continues to work without issue.
What is also odd about this is rebooting the HQ firewall seems to release the lockup whatever it is for a while. This would point to a potential issue with the SonicWall but once again - I only have to reboot the one to get traffic to flow to the BRANCH site from the COLO and when the outage happens it happens in unison between the sites.
This leads me to think there is something in the Juniper that is actually blocking the traffic - although the site to site policy is configured to permit any.
Has anyone encountered anything even remotely similar to this or have any guidance? As I have said this has been an ongoing headache for months and belive that we have narrowed it down to the firewall but don't see anything obvious that would be causing it to work some of the time and then fail the rest of the time.