I appreciate the detailed post.
Actually i was dealing with figuring out if i could even run the ansible playbook against the netconf IP itself.
Due to the gateway OS onsite i couldnt run the upgrade script due to outdated PHP that i cannot upgrade. From it but I was able to run a playbook to do a basic netconf check which passed so i assume that is actually "do-able". So now Im trying to figure out why I cant run playbooks over jumphost for my srx..
I am not too worried about downtime, the usual way I run it is I would likely reboot secondary and then primary to ensure there are no HW issues and then run an upgrade on secondary and then primary then reboot the cluster. I have a lab i can test on prior to that as well.
As for transferring the files between the srx it seemed to me that the SCP module got deprecated for scp from srx or to srx, so theyre now recommending using ansible.netcommon.net_get/_put so that may be my way in. I had issues with that too so ill need to do some more research on it.
So my current dilemma seems to be my jumphost and netconf and then I can most likely run the regular upgrade module, I just need to see if i can change the upgrade portion so that it is a manual process to perform on the cluster.
Original Message:
Sent: 06-15-2024 13:32
From: asharp
Subject: SRX Upgrade HA Chassis
Since I was using a combination of Ansible, PyEz scripts and Juniper Ansible modules, the requirement for that was that the SRX have NETCONF enabled. That as far as I can recall was the only pre-requisite that we had for the SRX.
Well the ansible playbook was just launched from a suitable workstation, for my testing/development I was using a Mac, the customer was using whatever host they wanted to use, I guess it would have been a linux host of some description. For development purposes I was using vSRX3.0 in a cluster running from a Windows10 desktop, and the customer was running it against a mixture of physical SRX and vSRX deployed on some setup that they had, can't recall all the details now as this was a project that I worked on two years ago now.
All comms for that particular project was just to the master-only address assigned to fxp0, so it only had connectivity to the primary node whichever that happened to be. Obviously we tested and ran the playbooks against a number of clusters both physical and virtual and from the playbook perspective it didn't care which node that it was talking to, since whichever node it was, it was going to be the primary node.
Now, I must say again, that this particular project was always going to involve at some point both nodes being rebooted close together, we couldn't isolate the nodes because we did not have connectivity to the secondary node, only to the primary node. So we were unable to follow the approach of a "minimal down-time", since we could not break the connectivity between the nodes as that would then mean that we had no way to reach the secondary node.
So what upgrade approach are you looking to perform? Are you trying to perform a minimal down-time upgrade? Or do you not care about failover and the like and just want to upgrade both nodes and reboot the entire cluster afterwards which will mean that the cluster will be offline for a few minutes?
To transfer the s/w image to the primary node, I just used the standard Juniper ansible module as far as I can recall, something like the following:
# software add on the primary node- name: Software upgrade primary node juniper_junos_software: provider: "{{ credentials }}" local_package: "{{ pkg_dir }}/{{ OS_package }}" remote_package: "{{ remote_package }}" no_copy: "{{ no_copy_image }}" reboot: "{{ reboot }}" validate: "{{ validate }}" checksum_algorithm: sha1 register: upgrade_response# assert that the software was installed successfully- name: Primary node install check assert: that: - "upgrade_response is match('.*successfully installed.*')" fail_msg: "Package failed to install!" success_msg: "Package installed successfully, awaiting reboot." when: not ansible_check_modeWhich leveraged the following variables.pkg_dir: "images"OS_version: "21.2R1.10"OS_package: "junos-install-vsrx3-x86-64-21.2R1.10.tgz"remote_package: "/var/tmp/"reboot: falsevalidate: falseno_copy_image: false
After the s/w was installed successfully on the primary node, then it was just necessary to copy the file from the primary node to the secondary node.
Which was performed with a custom module written in Python for Ansible, that was using the rcp -T command to copy the file from one node to another, and the playbook just used something like the following:
- name: Copy image to secondary node vsrx_cluster_copy_image: host: "{{ inventory_hostname }}" user: "{{ username }}" passwd: "{{ password }}" node: "{{ other_node }}" source: "{{ remote_package }}{{ OS_package }}" dest: "{{ remote_package }}" register: copy_response
We did have a few differences in behaviour between physical SRX and vSRX, some of the messages returned were in a different format as far as I can recall, so we had to put some logic into the playbook to identify what kind of device we were dealing with, Handled by this kind of approach.
- name: Gather facts juniper_junos_facts: provider: "{{ credentials }}" level: INFO register: junos- name: Identify chassis type set_fact: chassis_type: "{% if junos.ansible_facts.junos.model == 'VSRX' %}VSRX{% else %}SRX{% endif %}"# query facts to establish which re_name the connection has been made to- name: Identify node set_fact: node_name: "{{ junos.ansible_facts.junos.re_name }}"# a boolean true|false if this is the primary node in the cluster- name: Register primary node set_fact: primary_node: "{% if \ (junos.ansible_facts.junos.srx_cluster_redundancy_group['0'].node0.status == 'primary') \ and (node_name == 'node0') %}True{% elif \ (junos.ansible_facts.junos.srx_cluster_redundancy_group['0'].node1.status == 'primary') \ and (node_name == 'node1') %}True{% else %}False{% endif %}"# name of this node- name: Identify this node set_fact: this_node: "{{ junos.ansible_facts.junos.current_re[0] }}"# name of the other node- name: Identify other node set_fact: other_node: "{% if this_node == 'node0'%}node1{% else %}node0{% endif %}"# assert that this is the primary node- name: Verify this is the primary node assert: that: - primary_node | bool fail_msg: "Fail: This is the secondary node!"
I remind you again that you cannot just force through an upgrade of an SRX cluster. Each node must NOT see the other cluster member running on a different version of code. If that happens, usually the cluster just doesn't work, then you have to manually jump in and disconnect the nodes from each other, reboot them again and upgrade them individually before you can finally reboot them again and let them form a cluster once more.
I don't think that I can share the whole playbooks here, as I mentioned this was something developed as part of customer project.
Ideally, it would be better to understand what upgrade process you are trying to perform, on what type of SRX since the different models can have different approaches, and also the s/w versions that are involved in the upgrade etc. Then we can try to tailor the solution and approach to fit your needs, rather than just a particular project that I worked on that imho wasn't the right way to go about it, but we had no choice.
------------------------------
Andy Sharp
Original Message:
Sent: 06-14-2024 11:43
From: abdulellahib
Subject: SRX Upgrade HA Chassis
Did you just run the junos upgrade script via the FXP? I get errors attempting that, made me think either the issue was my bastion host or those out of band ports dont support the connection methods being used.
I tried configuring a whole script that would cause files to be downloaded then copied over via scp from one node to another but it seems it also errors out with only allowed via cli and the like.
My last attempt was going to be just pushing routing engine failovers, upgrading node 1 and rebooting which automatically would get node0 as primary again and then push the upgrade to that one but its kind of too finicky and double downtime.
Can you share what playbooks you used? Was file transfer not an issue for you, or did you instead of downloading from SRX, did you scp TO your srx from remote?
Original Message:
Sent: 06-14-2024 06:09
From: asharp
Subject: SRX Upgrade HA Chassis
Yes, I have used Ansible to upgrade SRX clusters in the past.
It's not straight-forward since you can't typically have both nodes running at the same time with a different version of code.
For example the following describes the manual process that you would need to follow when creating your role/playbook etc.
https://supportportal.juniper.net/s/article/SRX-How-to-upgrade-Junos-OS-on-a-Chassis-Cluster?language=en_US
https://supportportal.juniper.net/s/article/SRX-How-to-upgrade-an-SRX-cluster-with-minimal-down-time?language=en_US#:~:text=This%20is%20achieved%20by%20isolating,cluster%20of%20both%20upgraded%20nodes.
As you can see quite a lot to automate via Ansible, and the above approach does have a requirement that both nodes are reachable by their own IP and not just via the master-only IP.
A few years ago I did a project with a customer to automate the upgrade of their SRX and for whatever reasons (can't recall now), they did not have access to each node via fxp0, and the upgrade had to be performed via the master-only address. That was a pain, since we had to perform all the tasks via the primary node, and leveraging rlogin and a bunch of custom tricks to upgrade the backup., upgrade the backup (without a reboot), upgrade the primary (without a reboot), and then to reboot the backup node and very shortly afterwards reboot the primary, making sure that this was triggered before the backup had come online again. This did involve a short outage since we couldn't isolate the backup node as we would have lost access to it.
Regards,
------------------------------
Andy Sharp
Original Message:
Sent: 06-13-2024 02:40
From: Anonymous
Subject: SRX Upgrade HA Chassis
This message was posted by a user wishing to remain anonymous
Has anyone ever upgraded an SRX using ansible or another method for HA Clusters?
I attempted to upgrade my srx lab chassis but it seems the module only upgrades one of the nodes (whatever is primary).
It seems natively I cannot deploy the upgrade to multiple nodes in a chassis, and FXP IP of node0 and node1 cannot be used for netconf configs or ansible in general?
Theory was to run a playbook with manual failover of routing engines and then upgrading node1 with reboot then when node0 takes primary run the upgrade on that and then allow it to reboot but it seems unpractical and time consuming. I am surprised i was not able to find anything for it online.
I also attempted to run it via playbooks passing CLI commands but some commands like scp from node0 to node 1 were not allowed
Has anyone figured out upgrading SRX HA Chassis of 2 nodes with ansible or any other automation method?