Automation

 View Only
last person joined: yesterday 

Ask questions and share experiences about Apstra, Paragon, and all things network automation.
  • 1.  SRX Upgrade HA Chassis

    This message was posted by a user wishing to remain anonymous
    Posted 06-13-2024 06:57
    This message was posted by a user wishing to remain anonymous

    Has anyone ever upgraded an SRX using ansible or another method for HA Clusters?

    I attempted to upgrade my srx lab chassis but it seems the module only upgrades one of the nodes (whatever is primary).

    It seems natively I cannot deploy the upgrade to multiple nodes in a chassis, and FXP IP of node0 and node1 cannot be used for netconf configs or ansible in general?

    Theory was to run a playbook with manual failover of routing engines and then upgrading node1 with reboot then when node0 takes primary run the upgrade on that and then allow it to reboot but it seems unpractical and time consuming. I am surprised i was not able to find anything for it online.

    I also attempted to run it via playbooks passing CLI commands but some commands like scp from node0 to node 1 were not allowed


    Has anyone figured out upgrading SRX HA Chassis of 2 nodes with ansible or any other automation method?



  • 2.  RE: SRX Upgrade HA Chassis

     
    Posted 06-14-2024 06:10
    Edited by asharp 06-15-2024 10:11

    Yes, I have used Ansible to upgrade SRX clusters in the past.

    It's not straight-forward since you can't typically have both nodes running at the same time with a different version of code.

    For example the following describes the manual process that you would need to follow when creating your role/playbook etc.

    https://supportportal.juniper.net/s/article/SRX-How-to-upgrade-Junos-OS-on-a-Chassis-Cluster?language=en_US

    https://supportportal.juniper.net/s/article/SRX-How-to-upgrade-an-SRX-cluster-with-minimal-down-time?language=en_US#:~:text=This%20is%20achieved%20by%20isolating,cluster%20of%20both%20upgraded%20nodes.

    As you can see quite a lot to automate via Ansible, and the above approach does have a requirement that both nodes are reachable by their own IP and not just via the master-only IP.

    A few years ago I did a project with a customer to automate the upgrade of their SRX and for whatever reasons (can't recall now), they did not have access to each node via fxp0, and the upgrade had to be performed via the master-only address.  That was a pain, since we had to perform all the tasks via the primary node, and leveraging rlogin and a bunch of custom tricks to upgrade the backup., upgrade the backup (without a reboot), upgrade the primary (without a reboot), and then to reboot the backup node and very shortly afterwards reboot the primary, making sure that this was triggered before the backup had come online again.   This did involve a short outage since we couldn't isolate the backup node as we would have lost access to it.

    The latter approach required the use of some PyEz scripts to be created for a variety of specific purposes that were used during the workflow required. For example:

    Cleanup storage on the secondary node.  This involved using StartShell() to open a shell on the primary node.

    ss = StartShell(dev)
    ss.open()

    Then run rlogin -T node... to login to the required node 1 or 2 depending on which was the secondary etc.

    rlogincommand = "rlogin -T %s" % module.params["node"]
    shellcmd1 = ss.run(rlogincommand, timeout=30)[1]

     Then finally, get back to the shell of the secondary node, and actually trigger the cleanup command.  e,g.

    cleancommand = "cli -c 'request system storage cleanup no-confirm | display json | no-more; exit'"
    shellcmd2 = ss.run("start shell", timeout=30)[1]
    shellcmd3 = ss.run(cleancommand, timeout=30)[1]

    Not the most elegant of approaches, but something that worked for that particular project.

    A similar approach was used to copy software images from the primary to the secondary node, trigger the s/w install on the secondary node, 

    File copy...
            filecommand = "rcp -T %s %s:%s/." % (
                module.params["source"],
                module.params["node"],
                module.params["dest"],
            )
    
    sw install...
            if module.params["no_copy"] is True:
                no_copy = " no-copy"
            else:
                no_copy = ""
    
            if module.params["validate"] is False:
                validate = " no-validate"
            else:
                validate = " validate"
            rlogincommand = "rlogin -T %s" % module.params["node"]
            addswcommand = "request system software add%s%s %s" % (
                no_copy,
                validate,
                module.params["source"],
            )

    The trickiest part of the process when doing the entire s/w install/upgrade using just the master-only address was to reboot the secondary node. 

    Following the similar approach of connect to the primary node, open a shell, execute rlogin -T to login to the secondary node, start the shell once more to get to the shell of the secondary node, I then used the following command to trigger a NETCONF call via the shell to reboot the node.

    rebootcommand = '( echo "<rpc><request-reboot><in>1</in></request-reboot></rpc>" && cat ) | xml-mode netconf need-trailer'
    

    That was the first time that I had ever used xml-mode in this way, I knew that it existed already in Junos, and is leveraged by PyEz to trigger NETCONF commands via a console session for example.  

    Don't hesitate to ask if you need more assistance on this, but it is by far an easier process to automate when you have access to fxp0 on each of the nodes.

    Regards,



    ------------------------------
    Andy Sharp
    ------------------------------



  • 3.  RE: SRX Upgrade HA Chassis

    Posted 06-15-2024 12:21

    Did you just run the junos upgrade script via the FXP? I get errors attempting that, made me think either the issue was my bastion host or those out of band ports dont support the connection methods being used.

    I tried configuring a whole script that would cause files to be downloaded then copied over via scp from one node to another but it seems it also errors out with only allowed via cli and the like.

    My last attempt was going to be just pushing routing engine failovers, upgrading node 1 and rebooting which automatically would get node0 as primary again and then push the upgrade to that one but its kind of too finicky and double downtime.

    Can you share what playbooks you used? Was file transfer not an issue for you, or did you instead of downloading from SRX, did you scp TO your srx from remote?




  • 4.  RE: SRX Upgrade HA Chassis

     
    Posted 06-15-2024 14:03

    Since I was using a combination of Ansible, PyEz scripts and Juniper Ansible modules, the requirement for that was that the SRX have NETCONF enabled.  That as far as I can recall was the only pre-requisite that we had for the SRX. 

    Well the ansible playbook was just launched from a suitable workstation, for my testing/development I was using a Mac, the customer was using whatever host they wanted to use, I guess it would have been a linux host of some description.  For development purposes I was using vSRX3.0 in a cluster running from a Windows10 desktop, and the customer was running it against a mixture of physical SRX and vSRX deployed on some setup that they had, can't recall all the details now as this was a project that I worked on two years ago now.

    All comms for that particular project was just to the master-only address assigned to fxp0, so it only had connectivity to the primary node whichever that happened to be.  Obviously we tested and ran the playbooks against a number of clusters both physical and virtual and from the playbook perspective it didn't care which node that it was talking to, since whichever node it was, it was going to be the primary node.

    Now, I must say again, that this particular project was always going to involve at some point both nodes being rebooted close together, we couldn't isolate the nodes because we did not have connectivity to the secondary node, only to the primary node.  So we were unable to follow the approach of a "minimal down-time", since we could not break the connectivity between the nodes as that would then mean that we had no way to reach the secondary node.

    So what upgrade approach are you looking to perform?  Are you trying to perform a minimal down-time upgrade? Or do you not care about failover and the like and just want to upgrade both nodes and reboot the entire cluster afterwards which will mean that the cluster will be offline for a few minutes?

    To transfer the s/w image to the primary node, I just used the standard Juniper ansible module as far as I can recall, something like the following:

    # software add on the primary node
    - name: Software upgrade primary node
      juniper_junos_software:
        provider: "{{ credentials }}"
        local_package: "{{ pkg_dir }}/{{ OS_package }}"
        remote_package: "{{ remote_package }}"
        no_copy: "{{ no_copy_image }}"
        reboot: "{{ reboot }}"
        validate: "{{ validate }}"
        checksum_algorithm: sha1
      register: upgrade_response
    
    # assert that the software was installed successfully
    - name: Primary node install check
      assert:
        that:
          - "upgrade_response is match('.*successfully installed.*')"
        fail_msg: "Package failed to install!"
        success_msg: "Package installed successfully, awaiting reboot."
      when: not ansible_check_mode
    
    Which leveraged the following variables.
    pkg_dir: "images"
    OS_version: "21.2R1.10"
    OS_package: "junos-install-vsrx3-x86-64-21.2R1.10.tgz"
    remote_package: "/var/tmp/"
    reboot: false
    validate: false
    no_copy_image: false

    After the s/w was installed successfully on the primary node, then it was just necessary to copy the file from the primary node to the secondary node.

    Which was performed with a custom module written in Python for Ansible, that was using the rcp -T command to copy the file from one node to another, and the playbook just used something like the following:

    - name: Copy image to secondary node
      vsrx_cluster_copy_image:
        host: "{{ inventory_hostname }}"
        user: "{{ username }}"
        passwd: "{{ password }}"
        node: "{{ other_node }}"
        source: "{{ remote_package }}{{ OS_package }}"
        dest: "{{ remote_package }}"
      register: copy_response

    We did have a few differences in behaviour between physical SRX and vSRX, some of the messages returned were in a different format as far as I can recall, so we had to put some logic into the playbook to identify what kind of device we were dealing with,   Handled by this kind of approach.

    - name: Gather facts
      juniper_junos_facts:
        provider: "{{ credentials }}"
        level: INFO
      register: junos
    
    - name: Identify chassis type
      set_fact:
        chassis_type: "{% if junos.ansible_facts.junos.model == 'VSRX' %}VSRX{% else %}SRX{% endif %}"
    
    # query facts to establish which re_name the connection has been made to
    - name: Identify node
      set_fact: 
        node_name: "{{ junos.ansible_facts.junos.re_name }}"
    
    # a boolean true|false if this is the primary node in the cluster
    - name: Register primary node
      set_fact:
        primary_node: "{% if \ 
        (junos.ansible_facts.junos.srx_cluster_redundancy_group['0'].node0.status == 'primary') \ 
        and (node_name == 'node0') %}True{% elif \ 
        (junos.ansible_facts.junos.srx_cluster_redundancy_group['0'].node1.status == 'primary') \ 
        and (node_name == 'node1') %}True{% else %}False{% endif %}"
    
    # name of this node
    - name: Identify this node
      set_fact:
        this_node: "{{ junos.ansible_facts.junos.current_re[0] }}"
    
    # name of the other node
    - name: Identify other node
      set_fact:
        other_node: "{% if this_node == 'node0'%}node1{% else %}node0{% endif %}"
    
    # assert that this is the primary node
    - name: Verify this is the primary node
      assert:
        that:
          - primary_node | bool
        fail_msg: "Fail: This is the secondary node!"
    

    I remind you again that you cannot just force through an upgrade of an SRX cluster.  Each node must NOT see the other cluster member running on a different version of code.  If that happens, usually the cluster just doesn't work, then you have to manually jump in and disconnect the nodes from each other, reboot them again and upgrade them individually before you can finally reboot them again and let them form a cluster once more.

    I don't think that I can share the whole playbooks here, as I mentioned this was something developed as part of customer project.

    Ideally, it would be better to understand what upgrade process you are trying to perform, on what type of SRX since the different models can have different approaches, and also the s/w versions that are involved in the upgrade etc.   Then we can try to tailor the solution and approach to fit your needs, rather than just a particular project that I worked on that imho wasn't the right way to go about it, but we had no choice.



    ------------------------------
    Andy Sharp
    ------------------------------



  • 5.  RE: SRX Upgrade HA Chassis

    Posted 29 days ago

    I appreciate the detailed post.

    Actually i was dealing with figuring out if i could even run the ansible playbook against the netconf IP itself.

    Due to the gateway OS onsite i couldnt run the upgrade script due to outdated PHP that i cannot upgrade. From it but I was able to run a playbook to do a basic netconf check which passed so i assume that is actually "do-able". So now Im trying to figure out why I cant run playbooks over jumphost for my srx..

    I am not too worried about downtime, the usual way I run it is I would likely reboot secondary and then primary to ensure there are no HW issues and then run an upgrade on secondary and then primary then reboot the cluster. I have a lab i can test on prior to that as well.

    As for transferring the files between the srx it seemed to me that the SCP module got deprecated for scp from srx or to srx, so theyre now recommending using ansible.netcommon.net_get/_put so that may be my way in. I had issues with that too so ill need to do some more research on it.

    So my current dilemma seems to be my jumphost and netconf and then I can most likely run the regular upgrade module, I just need to see if i can change the upgrade portion so that it is a manual process to perform on the cluster.