@skb Steve, thank you for your message! I was able to track down my storage overuse to a random firewall log file in /var/logs, but I was interested in what you wrote as well, because, in fact, we have recently upgraded our EX3400 from 18.4 to 20.2, and they could use a new recovery snapshot.
I took a look at several virtual-chassis ranging in size from 2 to 7 switches. They all have storage utilization of 74%-76% in /dev/gpt/junos. They all also have 1 recovery (with old OS) snapshot and 1-2 non-recovery snapshots (with old and/or new OS).
The not so happy thing was that when I run the 'request system snapshot recovery all-members' on these stacks, the command succeeds on some individual switches in virtual chassis, but fails on others with a message like this:
--------------------------------------------------------------------------
Creating image ...
Compressing image ...
mkuzip: write(/var/tmp/.snap.73193/recovery.ufs.uzip): No space left on device
------------------------------------------------------------------------------
After I run 'request system snapshot recovery', whether it succeeds or fails, it cleans up some storage, like you noticed. It appears to do that by deleting the non-recovery snapshots. Afterwards, the storage utilization indeed drops to 44%-47%. Rerunning 'request system snapshot recovery' after that yields the same result as before - succeeds and fails on the same members as before.
On the switches I ran I saw the following:
Of the 10 virtual chassis I looked at I found the following.
3-stack: succeeds on 0, fails on 1 and 2
3-stack: succeeds on 0, fails on 1 and 2
single: succeeds
single: succeeds
single: succeeds
3-stack: succeeds on 0, fails on 1 and 2
7-stack: succeeds on 0 and 2-5, fails on 1 and 6
4-stack: succeeds on 0 and 3, fails on 1 and 2
2-stack: succeeds on 0, fails on 1
3-stack: succeeds on 0, fails on 1 and 2
In this sample, in case of virtual-chassis, it always fails on the master, and in all but one case fails on the backup, and sometimes fails on some linecard(s).
With the storage utilization on all switches being within such close range (44-47% or so) it seemed unlikely that the difference between success and failure is really just a few MB, and that 53% of available storage is insufficient for a snapshot creation.
I looked online and tried the following:
- I tried running 'request system storage cleanup' on each switch, and that made no difference.
- I also checked the /packages/sets/active/optional/ folder, and items there only point to packages from the current OS, not the old one.
- I checked /packages/db/ for any left-over packages from old OS version, and there were not any
- I rebooted entire chassis and that made no difference.
- I did not think the issue was with memory shortage on OAM volume where recovery snapshots are stored, as the error does not mention OAM (and I know some errors do mention OAM specifically). However, just to be sure, I mounted oam volume and deleted recovery.ufs.uzip and VERSION from /oam/shapshot/. Storage utilization of oam volume (before I unmounted it) went from 90% or 96% down to 20%, and 'show system snapshot' no longer shows any recovery snapshot for that switch, but I still cannot create a new one.
- I found 1 reference online to successfully fixing exactly the error I am getting (https://www.reddit.com/r/Juniper/comments/fmgv6j/update_failed_on_ex3400_virtual_chassis/) and executed commands suggested, which actually came from a thread here in Community (https://community.juniper.net/communities/community-home/digestviewer/viewthread?MID=69685#bm18771141-b085-40c6-beb9-b07141f8654e), and ran the shell commands 'pkg setop rm previous' and 'pkg delete old'. I suspect these are aimed at cleaning up any packages left over from the old OS, which I already checked I did not have, and so, disappointingly, running these commands changed nothing for me.
I thought perhaps there was something intrinsically different about switches which succeeded and those that failed, like a firmware version, but 'show chassis firmware detail' on EX3400 does not provide a whole lot of info and everything there was consistent anyway.
At one point I was logged into the same virtual chassis via two separate SSH sessions. In one I was running the 'request system snapshot recovery' command and waiting for results. The process eventually failed with the typical message below (note the snapshot number for later):
fpc1:
--------------------------------------------------------------------------
Creating image ...
Compressing image ...
mkuzip: write(/var/tmp/.snap.59694/recovery.ufs.uzip): No space left on device
--------------------------------------------------------------------------
In another session, I thought I check again if there is any more storage I can free up and ran the 'request system storage cleanup dry-run'. I happened to catch the recovery snapshot file compression stage, I guess, in progress:
fpc1:
--------------------------------------------------------------------------
List of files to delete:
Size Date Name
268.2K Aug 2 15:45 /var/log/User-Auth.0.gz
153B Aug 2 15:45 /var/log/conflicts.0.gz
50.5K Aug 2 15:45 /var/log/default-log-messages.0.gz
52.9K Aug 2 15:45 /var/log/firewall.0.gz
107.5K Aug 2 15:45 /var/log/interactive-commands.0.gz
256.8K Aug 2 15:44 /var/log/messages.0.gz
159.0K Aug 2 15:45 /var/log/security.0.gz
27B Aug 2 15:45 /var/log/wtmp.0.gz
70.6K Aug 2 15:41 /var/tmp/.snap.59694/contents.mtree
452.6M Aug 2 15:44 /var/tmp/.snap.59694/recovery.ufs
330.4M Aug 2 15:45 /var/tmp/.snap.59694/recovery.ufs.uzip
0B Jun 27 14:44 /var/tmp/LOCK_FILE
11B Aug 2 09:55 /var/tmp/alarmd.ts
--------------------------------------------------------------------
This looked like in the process of recovery snapshot creation at one point a total of 782MB worth of storage in /var/tmp was taken up with recovery snapshot creation activities. That is more than what I ever see available in /dev/gpt/junos/ with storage utilization of my 44%-47%:
fpc6:
--------------------------------------------------------------------------
Filesystem Size Used Avail Capacity Mounted on
/dev/gpt/junos 1.3G 560M 696M 45% /.mount
tmpfs 802M 48K 802M 0% /.mount/tmp
tmpfs 324M 324K 324M 0% /.mount/mfs
I thought before that perhaps some temporary storage is used to house recovery snapshot file after it is created and as it is being compressed - some storage outside the /dev/gpt/junos partition. That appears to be incorrect. Taken together with the message board post I found with error identical to mine and with the solution to clear old packages (also housed in /dev/gpt/junos), it increasingly looked like the issue, unlikely as it seems, was that I just did not have enough available storage in /dev/gpt/junos/ on some switches to create a recovery snapshot, and that in those cases I was short 2MB-20MB.
Looking at multiple stacks and multiple switches, it seemed that having 693MB available in /dev/gpt/junos was sufficient to create a new recovery snapshot, while 692MB was not (in my case anyway, given the specific OS version, etc.) I found a switch with 688MB. I identified a package in /packages/mount/ which I was pretty sure we would never use in our environment. The file was 24MB. From shell using 'pkg delete PACKAGE-NAME' I deleted it. That gave me 8MB of additional storage (not sure how 24MB becomes 8MB, but, whatever, I guess), for a total of 696MB. And then I was able to create a new recovery snapshot!
So, it seems like a system needs almost 700MB or some 55% of available storage in the main partition to create a recovery snapshot? I am no expert, but is not that impractical? In a lifetime of a switch OS might likely go throughs several major versions. If once the switch is in active use there will likely never be enough storage to create a recovery snapshot with a new OS, if we ever have to restore from OAM, we would be restoring to some original OS recovery snapshot created years ago. Why cannot some other temporary storage, outside the main junos partition, be used for recovery snapshot creation? Is there a way to change that?
Alternatively, can anyone think of some additional clean-up of some temporary or nonessential files I can perform that would free up the space I need to create a new recovery snapshot?
Thank you!
------------------------------
LYDIA
------------------------------
Original Message:
Sent: 07-31-2021 12:19
From: Steve Bohrer
Subject: EX storage full, and cleanup command does not help
I found on our EX3400s that installing a version update fills storage until I save a new recovery image with the updated version, which dropped the used storage from about 70% back to 37%
"sh system snapshot" will show you if your recovery snapshot is the same version as you have loaded, and also would reveal any non-recovery snapshots
Or, probably no harm in just trying "req sys snap recovery" and let it save a new one.
------------------------------
Steve Bohrer
Original Message:
Sent: 07-30-2021 00:44
From: Anonymous User
Subject: EX storage full, and cleanup command does not help
This message was posted by a user wishing to remain anonymous
I work with a number of EX3400 virtual-chassis of various sizes. Occasionally, a commit command will fail on a member of the virtual-chassis with a message that storage on that member is full. Normally this is easily resolved by running a 'request system storage cleanup member x' command, or by logging into the member in question with 'request session member x' and manually deleting multiple crash* and ksync* files from /var/tmp/.
However, occasionally, when a commit fails, and indeed storage on one of the members is full, that member has no crash* or ksync* files in its /var/tmp/ and running the 'cleanup' command, which deletes a few measly files, makes no significant difference to the storage situation. Rebooting that member, which is NOT the master, makes no difference to the storage situation. Rebooting the entire virtual chassis does not make a difference either.
The member 0 is not the master, but it is the member that the virtual-chassis connects through to the rest of the network. The virtual-chassis is configured to commit synchronize automatically. I only have remote access to the device.
I tried to look through the file system on member 0 as best I could, and compare it to other non-master member, and could not see any obvious difference. However, I am fairly new to both Juniper and FreeBSD. Any ideas on how I can figure out what is full on member 0, free up storage, and be able to change and commit configuration again?
Thanks!