Junos OS

IMPORTANT MODERATION NOTICE

This community is currently under full moderation, meaning  all posts will be reviewed before appearing in the community. Please expect a brief delay—there is no need to post multiple times. If your post is rejected, you'll receive an email outlining the reason(s). We've implemented full moderation to control spam. Thank you for your patience and participation.



Expand all | Collapse all

EX storage full, and cleanup command does not help

  • 1.  EX storage full, and cleanup command does not help

    This message was posted by a user wishing to remain anonymous
    Posted 07-30-2021 05:31
    This message was posted by a user wishing to remain anonymous

    I work with a number of EX3400 virtual-chassis of various sizes. Occasionally, a commit command will fail on a member of the virtual-chassis with a message that  storage on that member is full. Normally this is easily resolved by running a 'request system storage cleanup member x' command, or by logging into the member in question with 'request session member x' and manually deleting multiple crash* and ksync* files from /var/tmp/.  

    However, occasionally, when a commit fails, and indeed storage on one of the members is full, that member has no crash* or ksync* files in its /var/tmp/ and running the 'cleanup' command, which deletes a few measly files, makes no significant difference to the storage situation. Rebooting that member, which is NOT the master, makes no difference to the storage situation. Rebooting the entire virtual chassis does not make a difference either.
    member 0 storage full

    The member 0 is not the master, but it is the member that the virtual-chassis connects through to the rest of the network. The virtual-chassis is configured to commit synchronize automatically. I only have remote access to the device.

    I tried to look through the file system on member 0 as best I could, and compare it to other non-master member, and could not see any obvious difference. However, I am fairly new to both Juniper and FreeBSD. Any ideas on how I can figure out what is full on member 0, free up storage, and be able to change and commit configuration again?

    Thanks!


  • 2.  RE: EX storage full, and cleanup command does not help

    Posted 08-02-2021 05:17
    Maybe login to the member0 and try some
    request system snapshot delete snap.*
    and
    request system snapshot delete previous
    ?

    Actually you can try to login into the freebsd of the member0 (once in the cli of member0), and check at the root where all the stuff is:
    start shell
    cd /
    du -xhs /var
    du -xhs /packages/db
    exit

    and compare with the other members to check where the big difference is...

    ------------------------------
    Olivier Benghozi
    ------------------------------



  • 3.  RE: EX storage full, and cleanup command does not help

    Posted 08-02-2021 05:17
    I found on our EX3400s that installing a version update fills storage until I save a new recovery image with the updated version, which dropped the used storage from about 70% back to 37%

    "sh system snapshot" will show you if your recovery snapshot is the same version as you have loaded, and also would reveal any non-recovery snapshots

    Or, probably no harm in just trying "req sys snap recovery" and let it save a new one.

    ------------------------------
    Steve Bohrer
    ------------------------------



  • 4.  RE: EX storage full, and cleanup command does not help

    Posted 08-04-2021 14:32
    @Olivier Benghozi - Thank you! The disk utilization command worked! Comparing the problem switch in the stack with another non-master member I tracked the storage over-usage to a firewall log file in /var/log/. Now, you would think the 'request system storage clean-up' command I did run before would have taken care of this file, but I noticed its name (firewall.0 for example) was not within the standard naming convention I am used to seeing for current and archived log files (firewall, firewall.0.gz, firewall.1.gz, firewall.2.gz, etc.). So my guess is this one might be somehow outside the normal log creation and rotation process, and thus apparently could not be identified for cleanup by the storage clean-up command (although notice below for the second virtual-chassis, while this old firewall log file might be 'outside' the normal log creation and rotation logic, when a new log file is created WITHIN the standard log archival process, this old huge file gets renamed).

    I had two 3-member virtual chassis with the same issue. One one I saw:
    -rw-rw---- 1 root wheel 40902 Aug 2 09:11 firewall
    -rw-rw---- 1 root wheel 462979041 Jun 27 14:41 firewall.2

    On the other I saw:
    -rw-rw---- 1 root wheel 36792 Aug 2 09:18 firewall
    -rw-rw---- 1 root wheel 468090913 Jun 20 18:39 firewall.0
    And on the same switch, later in the day:
    -rw-rw---- 1 root wheel 3722373 Aug 2 16:41 firewall
    -rw-rw---- 1 root wheel 121759 Aug 2 13:45 firewall.0.gz
    -rw-rw---- 1 root wheel 468090913 Jun 20 18:39 firewall.1

    After I deleted the nearly 0.5BG files from both member 0 switches in both virtual-chassis, storage utilization on member 0 went down to match the other switches in virtual chassis, and I was able to commit configuration.

    I would love to know how and why this huge and old firewall log file was created and never deleted, in part to make sure this does not happen again, but do not think I will ever find out. My problem of not being able to commit is hence solved, however. Thanks again! ​

    ------------------------------
    LYDIALYDIA
    ------------------------------



  • 5.  RE: EX storage full, and cleanup command does not help

    Posted 08-04-2021 18:55
    @skb Steve, thank you for your message! I was able to track down my storage overuse to a random firewall log file in /var/logs, but I was interested in what you wrote as well, because, in fact, we have recently upgraded our EX3400 from 18.4 to 20.2, and they could use a new recovery snapshot.

    I took a look at several virtual-chassis ranging in size from 2 to 7 switches. They all have storage utilization of 74%-76% in /dev/gpt/junos. They all also have 1 recovery (with old OS) snapshot and 1-2 non-recovery snapshots (with old and/or new OS).

    The not so happy thing was that when I run the 'request system snapshot recovery all-members' on these stacks, the command succeeds on some individual switches in virtual chassis, but fails on others with a message like this:
    --------------------------------------------------------------------------
    Creating image ...
    Compressing image ...
    mkuzip: write(/var/tmp/.snap.73193/recovery.ufs.uzip): No space left on device
    ------------------------------------------------------------------------------

    After I run 'request system snapshot recovery', whether it succeeds or fails, it cleans up some storage, like you noticed. It appears to do that by deleting the non-recovery snapshots. Afterwards, the storage utilization indeed drops to 44%-47%. Rerunning 'request system snapshot recovery' after that yields the same result as before - succeeds and fails on the same members as before.

    On the switches I ran I saw the following:
    Of the 10 virtual chassis I looked at I found the following.
    3-stack: succeeds on 0, fails on 1 and 2
    3-stack: succeeds on 0, fails on 1 and 2
    single: succeeds
    single: succeeds
    single: succeeds
    3-stack: succeeds on 0, fails on 1 and 2
    7-stack: succeeds on 0 and 2-5, fails on 1 and 6
    4-stack: succeeds on 0 and 3, fails on 1 and 2
    2-stack: succeeds on 0, fails on 1
    3-stack: succeeds on 0, fails on 1 and 2
    In this sample, in case of virtual-chassis, it always fails on the master, and in all but one case fails on the backup, and sometimes fails on some linecard(s).

    With the storage utilization on all switches being within such close range (44-47% or so) it seemed unlikely that the difference between success and failure is really just a few MB, and that 53% of available storage is insufficient for a snapshot creation.

    I looked online and tried the following:
    - I tried running 'request system storage cleanup' on each switch, and that made no difference.
    - I also checked the /packages/sets/active/optional/ folder, and items there only point to packages from the current OS, not the old one.
    - I checked /packages/db/ for any left-over packages from old OS version, and there were not any
    - I rebooted entire chassis and that made no difference.
    - I did not think the issue was with memory shortage on OAM volume where recovery snapshots are stored, as the error does not mention OAM (and I know some errors do mention OAM specifically). However, just to be sure, I mounted oam volume and deleted recovery.ufs.uzip and VERSION from /oam/shapshot/. Storage utilization of oam volume (before I unmounted it) went from 90% or 96% down to 20%, and 'show system snapshot' no longer shows any recovery snapshot for that switch, but I still cannot create a new one.
    - I found 1 reference online to successfully fixing exactly the error I am getting (https://www.reddit.com/r/Juniper/comments/fmgv6j/update_failed_on_ex3400_virtual_chassis/) and executed commands suggested, which actually came from a thread here in Community (https://community.juniper.net/communities/community-home/digestviewer/viewthread?MID=69685#bm18771141-b085-40c6-beb9-b07141f8654e), and ran the shell commands 'pkg setop rm previous' and 'pkg delete old'. I suspect these are aimed at cleaning up any packages left over from the old OS, which I already checked I did not have, and so, disappointingly, running these commands changed nothing for me.

    I thought perhaps there was something intrinsically different about switches which succeeded and those that failed, like a firmware version, but 'show chassis firmware detail' on EX3400 does not provide a whole lot of info and everything there was consistent anyway.

    At one point I was logged into the same virtual chassis via two separate SSH sessions. In one I was running the 'request system snapshot recovery' command and waiting for results. The process eventually failed with the typical message below (note the snapshot number for later):
    fpc1:
    --------------------------------------------------------------------------
    Creating image ...
    Compressing image ...
    mkuzip: write(/var/tmp/.snap.59694/recovery.ufs.uzip): No space left on device
    --------------------------------------------------------------------------
    In another session, I thought I check again if there is any more storage I can free up and ran the 'request system storage cleanup dry-run'. I happened to catch the recovery snapshot file compression stage, I guess, in progress:
    fpc1:
    --------------------------------------------------------------------------
    List of files to delete:
    Size Date Name
    268.2K Aug 2 15:45 /var/log/User-Auth.0.gz
    153B Aug 2 15:45 /var/log/conflicts.0.gz
    50.5K Aug 2 15:45 /var/log/default-log-messages.0.gz
    52.9K Aug 2 15:45 /var/log/firewall.0.gz
    107.5K Aug 2 15:45 /var/log/interactive-commands.0.gz
    256.8K Aug 2 15:44 /var/log/messages.0.gz
    159.0K Aug 2 15:45 /var/log/security.0.gz
    27B Aug 2 15:45 /var/log/wtmp.0.gz
    70.6K Aug 2 15:41 /var/tmp/.snap.59694/contents.mtree
    452.6M Aug 2 15:44 /var/tmp/.snap.59694/recovery.ufs
    330.4M Aug 2 15:45 /var/tmp/.snap.59694/recovery.ufs.uzip
    0B Jun 27 14:44 /var/tmp/LOCK_FILE
    11B Aug 2 09:55 /var/tmp/alarmd.ts
    --------------------------------------------------------------------
    This looked like in the process of recovery snapshot creation at one point a total of 782MB worth of storage in /var/tmp was taken up with recovery snapshot creation activities. That is more than what I ever see available in /dev/gpt/junos/ with storage utilization of my 44%-47%:

    fpc6:
    --------------------------------------------------------------------------
    Filesystem Size Used Avail Capacity Mounted on
    /dev/gpt/junos 1.3G 560M 696M 45% /.mount
    tmpfs 802M 48K 802M 0% /.mount/tmp
    tmpfs 324M 324K 324M 0% /.mount/mfs

    I thought before that perhaps some temporary storage is used to house recovery snapshot file after it is created and as it is being compressed - some storage outside the /dev/gpt/junos partition. That appears to be incorrect. Taken together with the message board post I found with error identical to mine and with the solution to clear old packages (also housed in /dev/gpt/junos), it increasingly looked like the issue, unlikely as it seems, was that I just did not have enough available storage in /dev/gpt/junos/ on some switches to create a recovery snapshot, and that in those cases I was short 2MB-20MB.

    Looking at multiple stacks and multiple switches, it seemed that having 693MB available in /dev/gpt/junos was sufficient to create a new recovery snapshot, while 692MB was not (in my case anyway, given the specific OS version, etc.) I found a switch with 688MB. I identified a package in /packages/mount/ which I was pretty sure we would never use in our environment. The file was 24MB. From shell using 'pkg delete PACKAGE-NAME' I deleted it. That gave me 8MB of additional storage (not sure how 24MB becomes 8MB, but, whatever, I guess), for a total of 696MB. And then I was able to create a new recovery snapshot!

    So, it seems like a system needs almost 700MB or some 55% of available storage in the main partition to create a recovery snapshot? I am no expert, but is not that impractical? In a lifetime of a switch OS might likely go throughs several major versions. If once the switch is in active use there will likely never be enough storage to create a recovery snapshot with a new OS, if we ever have to restore from OAM, we would be restoring to some original OS recovery snapshot created years ago. Why cannot some other temporary storage, outside the main junos partition, be used for recovery snapshot creation? Is there a way to change that?
    Alternatively, can anyone think of some additional clean-up of some temporary or nonessential files I can perform that would free up the space I need to create a new recovery snapshot?

    Thank you! ​

    ------------------------------
    LYDIA
    ------------------------------



  • 6.  RE: EX storage full, and cleanup command does not help

    Posted 08-04-2021 20:18
    Actually I think that it does at least some
    request system snapshot delete snap.*
    request system snapshot delete previous

    which delete all sets (except the running one, of course), then delete older package files once they are no longer member of a  set (like a non-recovery snapshot snap.*, or the «previous» hidden one that contains the previous version – such ones don't existe on EX2300 after an install since they don't have enough flash space anyway).

    «show system snapshot» doesn't show the «previous» non-recovery snapshot, actually.

    You might try and check whether the free space is the same with those two lines than with a request snapshot recovery.

    ------------------------------
    Olivier Benghozi
    ------------------------------



  • 7.  RE: EX storage full, and cleanup command does not help

    Posted 08-04-2021 15:16
    Maybe a traceoption directed to a file named «firewall», maybe now or in the past? (that should probably need to be deactivated)

    show configuration | display set | match traceoptions

    ------------------------------
    Olivier Benghozi
    ------------------------------



  • 8.  RE: EX storage full, and cleanup command does not help

    Posted 08-04-2021 19:39
    @Olivier Benghozi I checked the master and  member 0 (just in case configs were not always synchronized) and neither currently has traceoptions configured. Configuring traceoptions or removing it would take a commit action, right? I noticed the firewall.X files were date-stamped with June 27 or 20 (assuming this year, as switches were not yet online last year), and the configuration on master or member 0 has not changed since April. Actually, I searched through all rollbacks going back to the moment we unpacked this switch from a box, and I see no traceoptions configured for the firewall...​​​​​

    ------------------------------
    LYDIA
    ------------------------------



  • 9.  RE: EX storage full, and cleanup command does not help

    Posted 08-04-2021 20:18
    One place that I've found remnants in is /.mount/root/ - tons of files named "...transferring file... ####" - I'm fairly certain these are remnants from our ZTP process so they may not be present in your switches, but just another place to look. I've definitely had the same issues as you so I've been scouring for any places to free up space when I can for upgrades / recovery snapshots. 9/10 times I'm fine, but there are always a couple switches that have issues.

    My latest fix has been to format install on 15.1X53-D591.1 and then to upgrade to our current standardized version of 19.4R3.11. I've found that if I format install on 19.4R3.11, there isn't enough space to do a recovery snapshot... so strange.



  • 10.  RE: EX storage full, and cleanup command does not help

    Posted 08-04-2021 20:18
    Sure, traceoptions needs a commit.
    Actually, could it be a syslog action in a firewall filter ?
    show configuration firewall | display set | match syslog

    ------------------------------
    Olivier Benghozi
    ------------------------------