RED HAT ENTERPRISE LINUX

Archiving and Compression

Creating Compressed Archives for Backup and Transfer

CIS238RH | RHEL System Administration 2
Mesa Community College

Learning Objectives

1
Create and extract tar archives

Bundle files preserving permissions, ownership, and structure

2
Apply compression with gzip, bzip2, and xz

Reduce file sizes using different compression algorithms

3
Create compressed archives efficiently

Combine archiving and compression in single operations

4
Transfer archives between systems

Use scp, rsync, and other methods for file transfer

Archiving vs Compression

Archiving combines multiple files into one. Compression reduces file size. These are separate operations that are often combined.

Archiving (tar)

Bundles files into single archive. Preserves permissions, ownership, timestamps, directory structure.

Compression (gzip/bzip2/xz)

Reduces file size using algorithms. Trades CPU time for smaller output.

Multiple Files tar (archive) .tar file gzip (compress) .tar.gz file
Why separate? Unix philosophy: each tool does one thing well. tar archives, gzip compresses. Combined: powerful and flexible.

The tar Command

tar (tape archive) is the standard Unix/Linux tool for creating archives. It preserves file metadata and directory structure.

c
Create archive
x
Extract archive
t
List contents
v
Verbose output
f
Filename
# Basic tar syntax
[student@server ~]$ tar [operation] [options] -f archive.tar [files...]

# The -f option MUST be followed by the filename
# Operations: c (create), x (extract), t (list) - pick ONE
# Options: v (verbose), z/j/J (compression), p (preserve permissions)
Remember: -f must be immediately followed by the archive filename. tar -cvf archive.tar not tar -cfv archive.tar

Creating Archives

# Create archive of a directory
[student@server ~]$ tar -cvf backup.tar /home/student/documents/
documents/
documents/report.txt
documents/data/
documents/data/file1.csv
documents/data/file2.csv

# Create archive of multiple items
[student@server ~]$ tar -cvf project.tar file1.txt file2.txt mydir/

# Create archive with absolute paths removed (default behavior)
[student@server ~]$ tar -cvf backup.tar /etc/hosts /etc/hostname
tar: Removing leading '/' from member names
etc/hosts
etc/hostname

# Preserve absolute paths (use with caution!)
[student@server ~]$ tar -cvPf backup.tar /etc/hosts

# Verify archive size
[student@server ~]$ ls -lh backup.tar
-rw-r--r--. 1 student student 15M Jan 20 14:00 backup.tar
Leading / removed: By default, tar strips the leading / from paths. This is a safety feature - extraction won't accidentally overwrite system files.

Listing and Extracting

# List archive contents without extracting
[student@server ~]$ tar -tvf backup.tar
drwxr-xr-x student/student   0 2024-01-20 14:00 documents/
-rw-r--r-- student/student 1024 2024-01-20 13:55 documents/report.txt
drwxr-xr-x student/student    0 2024-01-20 14:00 documents/data/
-rw-r--r-- student/student 2048 2024-01-20 13:50 documents/data/file1.csv

# Extract entire archive to current directory
[student@server ~]$ tar -xvf backup.tar

# Extract to a specific directory
[student@server ~]$ tar -xvf backup.tar -C /tmp/restore/

# Extract specific files only
[student@server ~]$ tar -xvf backup.tar documents/report.txt

# Extract files matching a pattern
[student@server ~]$ tar -xvf backup.tar --wildcards "*.csv"
Best Practice: Always use tar -tvf to inspect an archive before extracting. Know what you're about to unpack!

Compression Tools

gzip
Extension: .gz
Speed: Fast
Ratio: Good
Use: General purpose
bzip2
Extension: .bz2
Speed: Medium
Ratio: Better
Use: Better compression
xz
Extension: .xz
Speed: Slow
Ratio: Best
Use: Maximum compression
zip
Extension: .zip
Speed: Fast
Ratio: Good
Use: Cross-platform
# Compression comparison (100MB text file)
Original:     100 MB   (baseline)
gzip:          25 MB   ~2 seconds    (75% reduction)
bzip2:         20 MB   ~8 seconds    (80% reduction)  
xz:            15 MB   ~30 seconds   (85% reduction)
Choose wisely: gzip for speed, xz for size, bzip2 for balance. Results vary by data type - text compresses well, already-compressed files (images, videos) don't.

Using gzip

# Compress a file (replaces original with .gz)
[student@server ~]$ gzip largefile.txt
[student@server ~]$ ls
largefile.txt.gz

# Decompress (replaces .gz with original)
[student@server ~]$ gunzip largefile.txt.gz
# Or: gzip -d largefile.txt.gz

# Keep original file while compressing
[student@server ~]$ gzip -k largefile.txt
largefile.txt  largefile.txt.gz

# Compress to stdout (useful for pipes)
[student@server ~]$ gzip -c largefile.txt > largefile.txt.gz

# View compressed file without decompressing
[student@server ~]$ zcat largefile.txt.gz | head
[student@server ~]$ zless largefile.txt.gz

# Set compression level (1=fast, 9=best compression)
[student@server ~]$ gzip -9 largefile.txt     # Maximum compression
[student@server ~]$ gzip -1 largefile.txt     # Fastest
Note: gzip replaces the original file by default! Use -k to keep the original, or -c to output to stdout.

Using bzip2 and xz

# bzip2 - better compression, slower
[student@server ~]$ bzip2 largefile.txt           # Creates .bz2
[student@server ~]$ bunzip2 largefile.txt.bz2     # Decompress
[student@server ~]$ bzip2 -k largefile.txt        # Keep original
[student@server ~]$ bzcat largefile.txt.bz2       # View without decompress

# xz - best compression, slowest
[student@server ~]$ xz largefile.txt              # Creates .xz
[student@server ~]$ unxz largefile.txt.xz         # Decompress
[student@server ~]$ xz -k largefile.txt           # Keep original
[student@server ~]$ xzcat largefile.txt.xz        # View without decompress

# xz compression levels (0-9, default 6)
[student@server ~]$ xz -9 largefile.txt           # Maximum (very slow)
[student@server ~]$ xz -0 largefile.txt           # Fastest

# xz with threads for faster compression
[student@server ~]$ xz -T 4 largefile.txt         # Use 4 CPU threads
[student@server ~]$ xz -T 0 largefile.txt         # Use all available CPUs
Performance tip: xz supports multi-threading with -T. Use -T 0 to automatically use all CPU cores for faster compression.

Compressed tar Archives

# Create gzip-compressed archive
[student@server ~]$ tar -czvf backup.tar.gz /home/student/documents/
# -z tells tar to use gzip compression

# Create bzip2-compressed archive
[student@server ~]$ tar -cjvf backup.tar.bz2 /home/student/documents/
# -j tells tar to use bzip2 compression

# Create xz-compressed archive
[student@server ~]$ tar -cJvf backup.tar.xz /home/student/documents/
# -J (capital J) tells tar to use xz compression

# Extract compressed archives (tar auto-detects compression)
[student@server ~]$ tar -xvf backup.tar.gz       # Works!
[student@server ~]$ tar -xzvf backup.tar.gz      # Explicit gzip
[student@server ~]$ tar -xjvf backup.tar.bz2     # Explicit bzip2
[student@server ~]$ tar -xJvf backup.tar.xz      # Explicit xz
OptionCompressionExtension
-zgzip.tar.gz or .tgz
-jbzip2.tar.bz2 or .tbz2
-Jxz.tar.xz or .txz

Common tar Operations

# Preserve permissions (important for system backups)
[root@server ~]# tar -cvpzf backup.tar.gz /etc/
# -p preserves permissions (default when root, explicit is clearer)

# Exclude files/directories
[student@server ~]$ tar -czvf backup.tar.gz --exclude='*.log' /home/student/
[student@server ~]$ tar -czvf backup.tar.gz --exclude='cache' --exclude='tmp' /var/www/

# Exclude from file
[student@server ~]$ cat exclude.txt
*.tmp
*.log
cache/
.git/
[student@server ~]$ tar -czvf backup.tar.gz -X exclude.txt /home/student/

# Update archive with newer files only
[student@server ~]$ tar -uvf backup.tar /home/student/documents/

# Append files to existing archive (uncompressed only)
[student@server ~]$ tar -rvf backup.tar newfile.txt
Note: Update (-u) and append (-r) only work with uncompressed archives. Compressed archives must be recreated entirely.

File Extensions Reference

ExtensionTypeCreateExtract
.tarUncompressed archivetar -cvftar -xvf
.tar.gz / .tgzGzip compressedtar -czvftar -xzvf
.tar.bz2 / .tbz2Bzip2 compressedtar -cjvftar -xjvf
.tar.xz / .txzXZ compressedtar -cJvftar -xJvf
.gzGzip single filegzip filegunzip file.gz
.bz2Bzip2 single filebzip2 filebunzip2 file.bz2
.xzXZ single filexz fileunxz file.xz
.zipZip archivezip -r arch.zip dir/unzip arch.zip
Convention matters: Extensions tell users what type of file it is and how to handle it. Always use appropriate extensions.

The zip Command

zip creates archives compatible with Windows and other operating systems. It combines archiving and compression in one format.

# Create zip archive of files
[student@server ~]$ zip archive.zip file1.txt file2.txt file3.txt

# Create zip archive of directory (recursive)
[student@server ~]$ zip -r project.zip project/
  adding: project/ (stored 0%)
  adding: project/README.md (deflated 45%)
  adding: project/src/ (stored 0%)
  adding: project/src/main.c (deflated 62%)

# Extract zip archive
[student@server ~]$ unzip project.zip

# Extract to specific directory
[student@server ~]$ unzip project.zip -d /tmp/extract/

# List contents without extracting
[student@server ~]$ unzip -l project.zip

# Add password protection
[student@server ~]$ zip -e -r secure.zip sensitive/
Enter password: 
Verify password:
When to use zip: Sharing with Windows users, email attachments, when recipients expect .zip format. For Linux-to-Linux, tar.gz is more common.

Transferring with scp

scp (secure copy) transfers files between systems over SSH. It encrypts data in transit.

# Copy local file to remote system
[student@local ~]$ scp backup.tar.gz student@server:/home/student/
backup.tar.gz                      100%  15MB  10.2MB/s   00:01

# Copy from remote to local
[student@local ~]$ scp student@server:/home/student/data.tar.gz ./

# Copy directory recursively
[student@local ~]$ scp -r student@server:/var/www/ ./backup/

# Use specific port
[student@local ~]$ scp -P 2222 backup.tar.gz student@server:/home/student/

# Copy between two remote systems
[student@local ~]$ scp student@server1:/data/file.tar.gz student@server2:/backup/

# Preserve timestamps and permissions
[student@local ~]$ scp -p backup.tar.gz student@server:/home/student/

# Verbose mode for debugging
[student@local ~]$ scp -v backup.tar.gz student@server:/home/student/

Transferring with rsync

rsync efficiently synchronizes files, transferring only the differences. Ideal for backups and mirroring.

# Basic rsync (archive mode preserves everything)
[student@local ~]$ rsync -av /home/student/documents/ student@server:/backup/docs/

# Sync with compression during transfer
[student@local ~]$ rsync -avz /home/student/ student@server:/backup/

# Delete files on destination that don't exist on source (mirror)
[student@local ~]$ rsync -av --delete /source/ /destination/

# Dry run - show what would happen without doing it
[student@local ~]$ rsync -av --dry-run /source/ /destination/

# Exclude files
[student@local ~]$ rsync -av --exclude='*.log' --exclude='cache/' /src/ /dst/

# Show progress for large transfers
[student@local ~]$ rsync -av --progress /large/directory/ /backup/

# Resume interrupted transfer
[student@local ~]$ rsync -av --partial /source/ /destination/
Key advantage: rsync only transfers changed portions of files. Syncing a 10GB directory where 100MB changed transfers only ~100MB.

Backup Best Practices

# Create timestamped backup
[student@server ~]$ tar -czvf backup-$(date +%Y%m%d).tar.gz /home/student/

# Or with full timestamp
[student@server ~]$ tar -czvf backup-$(date +%Y%m%d-%H%M%S).tar.gz /home/student/
backup-20240120-143052.tar.gz

# Verify archive integrity after creation
[student@server ~]$ tar -tzvf backup-20240120.tar.gz > /dev/null && echo "Archive OK"
Archive OK

# Create checksum for verification
[student@server ~]$ sha256sum backup-20240120.tar.gz > backup-20240120.tar.gz.sha256
[student@server ~]$ sha256sum -c backup-20240120.tar.gz.sha256
backup-20240120.tar.gz: OK

# Full backup script example
[student@server ~]$ cat backup.sh
#!/bin/bash
DATE=$(date +%Y%m%d)
BACKUP_DIR="/backup"
tar -czvf "$BACKUP_DIR/home-$DATE.tar.gz" /home/
sha256sum "$BACKUP_DIR/home-$DATE.tar.gz" > "$BACKUP_DIR/home-$DATE.tar.gz.sha256"
# Keep only last 7 days
find "$BACKUP_DIR" -name "home-*.tar.gz" -mtime +7 -delete

System Recovery Archives

# Backup critical system directories (as root)
[root@server ~]# tar -czvpf system-config.tar.gz \
    /etc \
    /var/spool/cron \
    /root \
    --exclude='/etc/mtab'

# Backup user home directories
[root@server ~]# tar -czvpf homes.tar.gz /home/

# Full system backup (excluding pseudo-filesystems)
[root@server ~]# tar -czvpf full-backup.tar.gz \
    --exclude=/proc \
    --exclude=/sys \
    --exclude=/dev \
    --exclude=/run \
    --exclude=/tmp \
    --exclude=/mnt \
    --exclude=/media \
    --exclude=/lost+found \
    --exclude='full-backup.tar.gz' \
    /

# Restore while preserving ownership (as root)
[root@server ~]# tar -xzvpf system-config.tar.gz -C /
# -p preserves permissions
# -C / extracts to root filesystem
⚠ Caution: Restoring system files can break your system. Test on non-production systems first. Have rescue media ready.

Working with Large Archives

# Split large archive into smaller pieces
[student@server ~]$ tar -czvf - /large/directory/ | split -b 1G - backup.tar.gz.part
backup.tar.gz.partaa
backup.tar.gz.partab
backup.tar.gz.partac

# Reassemble and extract
[student@server ~]$ cat backup.tar.gz.part* | tar -xzvf -

# Create archive directly to remote system (no local storage needed)
[student@server ~]$ tar -czvf - /home/student/ | ssh user@backup-server "cat > /backup/home.tar.gz"

# Extract from remote archive without downloading
[student@server ~]$ ssh user@backup-server "cat /backup/home.tar.gz" | tar -xzvf -

# Stream archive through compression to remote
[student@server ~]$ tar -cvf - /data/ | xz -T 0 | ssh user@server "cat > /backup/data.tar.xz"

# Check archive progress (with pv if available)
[student@server ~]$ tar -cvf - /large/ | pv | gzip > large.tar.gz
45.2GiB 0:12:34 [61.5MiB/s] [=========>                    ] 32% ETA 0:26:12
Streaming: Using -f - makes tar read/write to stdin/stdout, enabling powerful pipeline operations.

Troubleshooting Archives

# Archive is corrupted - try to extract what's possible
[student@server ~]$ tar -xvf damaged.tar --ignore-zeros
[student@server ~]$ gzip -d -f damaged.tar.gz       # Force despite errors

# Test archive integrity
[student@server ~]$ gzip -t backup.tar.gz
[student@server ~]$ bzip2 -t backup.tar.bz2
[student@server ~]$ xz -t backup.tar.xz

# Identify compression type of unknown file
[student@server ~]$ file mystery.archive
mystery.archive: gzip compressed data, last modified: Sat Jan 20 14:00:00 2024

# Check what's using disk space during archiving
[student@server ~]$ du -sh /home/student/* | sort -h | tail -10

# Permission denied during extraction - need root?
[student@server ~]$ tar -xvf backup.tar 2>&1 | grep -i "permission denied"

# File changed during archiving
tar: /var/log/messages: file changed as we read it
# This warning is usually OK - file was modified during backup
Common Issues: Permission denied (run as root), disk full (check space), corrupted archives (verify checksums), wrong compression flag.

Best Practices

✓ Do

  • Use timestamp in backup filenames
  • Verify archives after creation with -t
  • Create checksums for verification
  • Use -p for system backups (preserve permissions)
  • Test restoration process periodically
  • Use appropriate compression for the situation
  • List contents before extracting unknown archives
  • Use rsync for incremental/regular syncs

✗ Don't

  • Trust backups without verification
  • Extract archives as root without checking contents
  • Forget -r with zip for directories
  • Use absolute paths without understanding implications
  • Compress already-compressed files (JPEG, MP4, ZIP)
  • Keep backups only on the same disk
  • Ignore "file changed" warnings for critical data
  • Delete original until backup is verified
Remember: A backup that hasn't been tested isn't a backup - it's a hope. Regularly practice restoration!

Key Takeaways

1

tar: Create with -cvf, extract with -xvf, list with -tvf. Always use -f with filename. -p preserves permissions.

2

Compression: gzip (-z) fast, bzip2 (-j) better ratio, xz (-J) best ratio. Match to your speed/size needs.

3

Compressed archives: tar -czvf for .tar.gz, tar -cjvf for .tar.bz2, tar -cJvf for .tar.xz

4

Transfer: scp for simple copies, rsync for efficient sync. Both work over SSH securely.

Graded Lab

  • Create a tar archive of your home directory
  • Create gzip, bzip2, and xz compressed archives and compare sizes
  • List contents of an archive without extracting
  • Extract specific files from an archive
  • Transfer an archive to another system with scp
  • Use rsync to synchronize a directory

Next: Transfering Files