now before I write the solution, I must say that I did what I thought was due diligence:
I installed 22.3 on dozen EX2300 located in a separate rack, always powered on and ready to be configured for the next client installation. Everything worked fine.
Before upgrading the 250 switches in production, I also upgraded 5 EX2300 on some sites nearby that I controlled just in case something went wrong. Nothing went wrong. .
As far as I can tell, what happenned is this...
at step 4, I mentionned that I had used my lab setup configuration to prepare a configuration using xe-interface. I copied the configuration to the 2 produciton switches, but without paying attention to the fact that those 2 prod switches now had a me0 management port configured.
a year (and some images updates later) I finally pushed the new 22.3 image on the prod switches. My 2nd mistake was to not pay attention to the fact that that add software command had the no-validate options.
last chance: do you see what's coming ???? when I asked remotely for a reboot of all 2300s, I lost access to the 2 switches that had an me0 mgmt interface with an ip address since 22.3 will refuse a commit when vme AND me0 interface are configured. vme was stil there from the factory configuration
error: Address cannot be configured on me0 and vme at the same time
error: configuration check-out failed
I was therefore rebooting a switch with a non-working configuration. And since I had carefully updated the rescue configuration to the running one before reboot, even the rescue configuration did not installed on the switch.
Had I not used the no-validate option when doing the image update, the installation would have aborted with an clear error message
Interface control process: <message>Address cannot be configured on me0 and vme at the same time</message>
Interface control process: </xnm:error>
mgd: error: configuration check-out failed
ERROR: Current configuration not compatible with junos-arm-32-22.3R1.11.tgz
My lesson learned is not to use the no-validate flag when doing image upgrade. and also upgrading ny small groupsL you never know when an surprising sequence of events will destroy your best laid plans.
Sent: 02-03-2023 08:26
From: Michel Lapointe
Subject: How about a catastrophe scenario ...
Let's see when YOU can spot the problem before it happens
1) I use mgmt port in my lab on my 10 2300-C switches for convenience
2) I don't use mgmt. port on the 250 switches in the field. I have a management vlan shared by all switches
3) Lab and fields run on Junos 20.4R2S2
4) Last year, we had to add an sfp with xe interface on 2 field switches.
5) I tested a config in my lab and, when I was sure it worked , copied it from the lab remotely on the 2 sites.
6) Last week we wanted to upgrade the 250 prod switches to Junos 22.3
7) I read the note on 22.3 and don't notice anything that applies to my configurations
8) I did the same steps that I have been doing for 3 years, hundreds of time
a. Have coffee
b. Put the new image in /tmp directory on each switch
c. Md5 check that the file is OK
d. Request software add <new image> no-copy no-validate
e. Make sure the image is correctly installed with the last line " will take effect at next reboot"
f. Notify clients of the coming 15mns re-start of the 2300s at scheduled date
g. Have coffee
9) On the scheduled date, at 23:00, I ask for a reboot on all switches
10) After 15 mns, all switches were back online, running on Junos 22.3 ... all except 2 for which I had lost total contact.
Took me a while to figure it out. I learned from my mistake and I'll share the solution with you next week J