Regular Expressions

Pattern Matching for System Administration

RH134: Red Hat System Administration II

Efficiently complete system administration tasks by matching text patterns

Learning Objectives

  • Understand the purpose and syntax of regular expressions
  • Construct patterns using metacharacters
  • Use character classes and quantifiers
  • Apply anchors and grouping
  • Differentiate between BRE and ERE
  • Use regex with grep, sed, and awk

What are Regular Expressions?

Regular Expression (Regex) = A sequence of characters that defines a search pattern for matching text

Use Cases

  • Searching log files
  • Validating input formats
  • Extracting data
  • Find and replace operations
  • Filtering output

Tools That Use Regex

  • grep / egrep
  • sed
  • awk
  • vim
  • less / more

Literal vs. Metacharacters

Literal Characters

Match themselves exactly

Pattern: cat
Matches: "cat", "scatter"
No match: "Car", "CAT"

Metacharacters

Have special meaning

Pattern: c.t
Matches: "cat", "cot", "cut"
No match: "ct", "caat"

Metacharacters: . ^ $ * + ? { } [ ] \ | ( )

BRE vs ERE

Basic Regular Expressions

Default for grep, sed

  • + ? { } | ( ) are literal
  • Use \+ \? \{\} \| \(\) for special meaning
  • More escaping required
# BRE - escape for special meaning
grep 'ab\+c' file
grep '\(ab\)\+' file

Extended Regular Expressions

grep -E, egrep, awk

  • + ? { } | ( ) are special
  • Escape to match literally
  • Cleaner, more intuitive syntax
# ERE - cleaner syntax
grep -E 'ab+c' file
grep -E '(ab)+' file

💡 Recommendation: Use ERE (grep -E) for cleaner, more readable patterns

The Dot Metacharacter

. matches any single character (except newline)

Pattern: h.t
✓ "hat", "hot", "hit", "h9t", "h t"
✗ "ht", "hoot", "HEAT"
# Find any three-letter word starting with 'c' and ending with 't'
grep 'c.t' /usr/share/dict/words
# cat, cot, cut, c@t, etc.

# Match any character between slashes
grep '/./.' /etc/passwd
# Matches paths like /bin/bash

# Be careful - dot is greedy!
echo "192.168.1.1" | grep '192.168.1.1'    # Literal (but . matches anything!)
echo "192x168y1z1" | grep '192.168.1.1'    # Also matches!

Anchors: ^ and $

^ matches start of line

Pattern: ^root
✓ "root:x:0:0..."
✗ "the root user"

$ matches end of line

Pattern: bash$
✓ "/bin/bash"
✗ "bash script"
# Find users with bash shell
grep 'bash$' /etc/passwd

# Find comment lines (starting with #)
grep '^#' /etc/ssh/sshd_config

# Find empty lines
grep '^$' /etc/ssh/sshd_config

# Find lines with ONLY "root"
grep '^root$' /etc/group

Escaping Metacharacters

\ removes special meaning from the next character

# Match a literal dot (IP address)
grep '192\.168\.1\.1' /etc/hosts

# Match a literal asterisk
grep '\*\*\*' logfile

# Match a dollar sign
grep '\$HOME' script.sh

# Match a caret
grep '\^' file

# Match a backslash itself
grep '\\' /etc/fstab

⚠️ Shell Quoting: Use single quotes to prevent shell expansion!

grep '$HOME'  # Regex sees: $HOME (end anchor + HOME)
grep "\$HOME" # Regex sees: $HOME (shell expands \$ to $)

Character Classes: [ ]

[abc] matches any ONE character from the set

Pattern: [aeiou]
✓ "hello" - matches 'e'
✓ "world" - matches 'o'
✗ "myth" - no vowels
# Match lines containing a vowel
grep '[aeiou]' /etc/passwd

# Match any digit
grep '[0123456789]' file

# Case insensitive matching
grep '[Rr]oot' /etc/passwd

# Match specific characters
grep 'log[0-9]' /var/log/

Character Ranges

[a-z] matches any character in the range

Range Matches Example
[a-z] Lowercase letters a, b, c, ... z
[A-Z] Uppercase letters A, B, C, ... Z
[0-9] Digits 0, 1, 2, ... 9
[a-zA-Z] All letters Any letter
[a-zA-Z0-9] Alphanumeric Letters and digits
[0-9a-fA-F] Hexadecimal 0-9, a-f, A-F
# Find lines starting with uppercase letter
grep '^[A-Z]' /etc/services

Negated Character Classes

[^abc] matches any character NOT in the set

Pattern: [^0-9]
✓ "abc" - 'a' is not a digit
✓ "1x2" - 'x' is not a digit
✗ "123" - all are digits
# Find lines NOT starting with # (non-comments)
grep '^[^#]' /etc/ssh/sshd_config

# Find lines containing non-alphanumeric characters
grep '[^a-zA-Z0-9]' passwords.txt

# Find non-printable characters
grep '[^[:print:]]' file

⚠️ Note: ^ means negation only when it's the first character inside [ ]

POSIX Character Classes

Class Equivalent Matches
[[:alpha:]] [a-zA-Z] Alphabetic characters
[[:digit:]] [0-9] Digits
[[:alnum:]] [a-zA-Z0-9] Alphanumeric
[[:space:]] [ \t\n\r\f\v] Whitespace
[[:lower:]] [a-z] Lowercase
[[:upper:]] [A-Z] Uppercase
[[:punct:]] - Punctuation
[[:print:]] - Printable characters
# Find lines with digits (locale-safe)
grep '[[:digit:]]' /var/log/messages

Quantifiers: How Many?

Specify how many times the preceding element should match

*
0 or more
+
1 or more
?
0 or 1
{n}
exactly n
{n,m}
n to m

The Asterisk: Zero or More

* matches the preceding element zero or more times

Pattern: ab*c
✓ "ac" - zero b's
✓ "abc" - one b
✓ "abbbc" - three b's
✗ "adc" - wrong character
# Match "color" or "colour"
grep 'colou*r' file

# Match any amount of whitespace
grep 'error:  *' logfile    # Space followed by zero or more spaces

# Match anything (greedy!)
grep '.*' file              # Matches entire line

# Common pattern: find lines with repeated characters
grep 'ss*' /etc/passwd      # One or more 's'

Plus and Question Mark (ERE)

+ = one or more

Pattern: ab+c
✗ "ac" - needs at least one b
✓ "abc"
✓ "abbbc"

? = zero or one

Pattern: colou?r
✓ "color"
✓ "colour"
✗ "colouur"
# ERE: Match one or more digits (must use -E)
grep -E '[0-9]+' /var/log/messages

# ERE: Optional 's' for plural
grep -E 'files?' file

# BRE equivalent (escaped)
grep '[0-9]\+' /var/log/messages
grep 'files\?' file

Interval Quantifiers: { }

{n,m} matches between n and m times (inclusive)

Syntax Meaning Example
{3} Exactly 3 times [0-9]{3} = "123"
{2,4} 2 to 4 times a{2,4} = "aa", "aaa", "aaaa"
{2,} 2 or more times x{2,} = "xx", "xxx", ...
{0,3} 0 to 3 times y{0,3} = "", "y", "yy", "yyy"
# Match US ZIP codes (5 digits)
grep -E '^[0-9]{5}$' zipcodes.txt

# Match ZIP+4 format (5 digits, hyphen, 4 digits)
grep -E '^[0-9]{5}-[0-9]{4}$' zipcodes.txt

# Match 2-4 letter words
grep -E '\b[a-zA-Z]{2,4}\b' document.txt

Greedy vs. Lazy Matching

⚠️ Quantifiers are greedy by default - they match as much as possible

Text: <b>bold</b> and <b>more</b>
Pattern: <b>.*</b>
Greedy match: "<b>bold</b> and <b>more</b>"
Better pattern: <b>[^<]*</b>
Matches: "<b>bold</b>" then "<b>more</b>"
# Problem: greedy matching
echo 'first second' | grep -o '.*'
# Returns: first second

# Solution: negated character class
echo 'first second' | grep -oE '[^<]*'
# Returns: first
#          second

Alternation: The OR Operator

| matches either the expression before OR after

Pattern: cat|dog
✓ "I have a cat"
✓ "I have a dog"
✗ "I have a bird"
# Match error or warning
grep -E 'error|warning' /var/log/messages

# Match multiple file extensions
ls | grep -E '\.jpg|\.png|\.gif'

# Match different log levels
grep -E 'ERROR|WARN|FATAL' application.log

# BRE requires escape
grep 'error\|warning' /var/log/messages

Grouping with Parentheses

( ) groups expressions for quantifiers and alternation

Without Grouping

Pattern: ab+
Matches: a followed by one+ b's
✓ "ab", "abb", "abbb"

With Grouping

Pattern: (ab)+
Matches: "ab" one or more times
✓ "ab", "abab", "ababab"
# Repeat a group
grep -E '(na)+' lyrics.txt          # "na", "nana", "nanana"

# Group with alternation
grep -E 'http(s)?://' urls.txt      # http:// or https://

# Complex grouping
grep -E '(Mon|Tue|Wed|Thu|Fri)day' calendar.txt

Backreferences

\1, \2 reference previously matched groups

# Find repeated words
grep -E '\b([a-z]+)\s+\1\b' document.txt
# Matches: "the the", "is is", etc.

# Find lines where first and last word are the same
grep -E '^([a-zA-Z]+).*\1$' file

# Match HTML tags with matching close tags
grep -E '<([a-z]+)>.*' file.html

# Find duplicate lines (consecutive)
sort file | grep -E '^(.*)$' | uniq -d

💡 Use Case: Finding duplicate words, validating paired elements, data consistency checks

grep: Pattern Searching

The primary tool for regex searching in Linux

grep

Basic Regular Expressions

grep -E / egrep

Extended Regular Expressions

grep -F / fgrep

Fixed strings (no regex)

Essential grep Options

Option Description Example
-i Case insensitive grep -i 'error'
-v Invert match grep -v '^#'
-c Count matches grep -c 'pattern'
-n Show line numbers grep -n 'TODO'
-l List filenames only grep -l 'main' *.c
-o Only matching part grep -oE '[0-9]+'
-r Recursive search grep -r 'config' /etc
-w Whole word match grep -w 'is'

Context Options

# Show 3 lines BEFORE match
grep -B3 'error' /var/log/messages

# Show 3 lines AFTER match
grep -A3 'error' /var/log/messages

# Show 3 lines before AND after (context)
grep -C3 'error' /var/log/messages

# Combine with other options
grep -B2 -A2 -n 'Exception' application.log
--
May 15 10:23:45 server process[1234]: Starting operation
May 15 10:23:46 server process[1234]: Loading config
May 15 10:23:47 server process[1234]: error: config not found
May 15 10:23:48 server process[1234]: Falling back to defaults
May 15 10:23:49 server process[1234]: Continuing...

grep Practical Examples

# Find failed SSH logins
grep -E 'Failed password|authentication failure' /var/log/secure

# Extract IP addresses from log
grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' access.log

# Find active config lines (not comments, not empty)
grep -v '^#' /etc/ssh/sshd_config | grep -v '^$'

# Better: combine with ERE
grep -vE '^#|^$' /etc/ssh/sshd_config

# Count error types
grep -oE 'error [0-9]+' log | sort | uniq -c | sort -rn

# Find files containing pattern
grep -rl 'TODO' --include='*.py' ./src/

# Multiple patterns from file
grep -f patterns.txt logfile

sed: Stream Editor

sed applies text transformations using regular expressions

# Basic syntax
sed 's/pattern/replacement/' file
sed 's/pattern/replacement/g' file    # Global (all occurrences)

# In-place editing
sed -i 's/old/new/g' file             # Modifies file directly
sed -i.bak 's/old/new/g' file         # Creates backup first

⚠️ Critical: sed -i modifies files directly! Always test first or create backups.

sed Substitution Patterns

# Basic substitution
sed 's/error/ERROR/' logfile

# Global substitution (all occurrences on line)
sed 's/old/new/g' file

# Case insensitive
sed 's/error/ERROR/gi' file

# Delete matching lines
sed '/pattern/d' file

# Delete empty lines
sed '/^$/d' file

# Delete comments
sed '/^#/d' /etc/config

# Multiple operations
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file

# Using different delimiter (useful for paths)
sed 's|/usr/local|/opt|g' file

sed with Capture Groups

# Swap first two fields (colon-separated)
sed 's/\([^:]*\):\([^:]*\)/\2:\1/' /etc/passwd

# ERE syntax (cleaner)
sed -E 's/([^:]*):([^:]*)/\2:\1/' /etc/passwd

# Reformat date: MM/DD/YYYY to YYYY-MM-DD
sed -E 's|([0-9]{2})/([0-9]{2})/([0-9]{4})|\3-\1-\2|g' dates.txt

# Add prefix to captured content
sed -E 's/^([0-9]+)/ID: \1/' file

# Surround matches with tags
sed -E 's/([0-9]{3}-[0-9]{4})/PHONE:\1:PHONE/g' contacts.txt

# Remove duplicate words
sed -E 's/\b([a-z]+)\s+\1\b/\1/g' document.txt

sed Address Ranges

# Apply only to line 5
sed '5s/old/new/' file

# Apply to lines 5-10
sed '5,10s/old/new/' file

# Apply from line 5 to end
sed '5,$s/old/new/' file

# Apply to lines matching pattern
sed '/^#/s/old/new/' file

# Apply between two patterns
sed '/START/,/END/s/old/new/' file

# Delete from pattern to end of file
sed '/pattern/,$d' file

# Print only lines 10-20
sed -n '10,20p' file

awk: Pattern Processing

awk combines regex pattern matching with field processing

# Basic syntax
awk '/pattern/ { action }' file

# Print lines matching pattern
awk '/error/' /var/log/messages

# Print specific fields from matching lines
awk '/error/ { print $1, $5 }' /var/log/messages

# Field separator
awk -F: '/root/ { print $1, $7 }' /etc/passwd

awk Pattern Matching

# Match at beginning of line
awk '/^root/' /etc/passwd

# Match at end of line
awk '/bash$/' /etc/passwd

# Match specific field
awk -F: '$7 ~ /bash/' /etc/passwd    # Field 7 contains "bash"
awk -F: '$7 == "/bin/bash"' /etc/passwd  # Field 7 equals exactly

# Negation
awk -F: '$7 !~ /nologin/' /etc/passwd    # Field 7 doesn't contain

# Complex conditions
awk -F: '$3 >= 1000 && $7 ~ /bash/' /etc/passwd

# Multiple patterns
awk '/start/,/end/' file    # Range between patterns

awk Practical Examples

# Sum values in a column
awk '{ sum += $1 } END { print sum }' numbers.txt

# Average of matching lines
awk '/error/ { count++; sum += $NF } END { print sum/count }' log

# Extract unique values
awk -F: '{ print $7 }' /etc/passwd | sort -u

# Format output
awk -F: '{ printf "%-15s %s\n", $1, $7 }' /etc/passwd

# Count pattern occurrences by category
awk '/error/ { errors++ } /warning/ { warnings++ } 
     END { print "Errors:", errors, "Warnings:", warnings }' log

# Process Apache logs - count requests per IP
awk '{ ips[$1]++ } END { for (ip in ips) print ip, ips[ip] }' access.log

Common Patterns Library

Frequently used regex patterns for system administration

IP Addresses Email Dates/Times URLs Log Entries

IP Address Pattern

[0-9]{1,3} 1-3 digits
\. literal dot
[0-9]{1,3} 1-3 digits
\. literal dot
[0-9]{1,3} 1-3 digits
\. literal dot
[0-9]{1,3} 1-3 digits
# Simple IP pattern (matches invalid IPs like 999.999.999.999)
grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' log

# Extract IPs from Apache log
awk '{ print $1 }' access.log | grep -oE '[0-9.]+' | sort -u

# Count connections per IP
grep -oE '^[0-9.]+' access.log | sort | uniq -c | sort -rn | head

# Find specific subnet
grep -E '192\.168\.[0-9]+\.[0-9]+' /var/log/messages

Email and URL Patterns

# Basic email pattern
grep -oE '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}' contacts.txt

# URL pattern
grep -oE 'https?://[^[:space:]]+' document.txt

# Domain extraction from URL
grep -oE 'https?://[^/]+' urls.txt | sed 's|https\?://||'

# Find mailto links in HTML
grep -oE 'mailto:[^"]+' page.html

# Validate URL format
if [[ "$URL" =~ ^https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} ]]; then
    echo "Valid URL"
fi

💡 Note: RFC-compliant email/URL validation is complex. These patterns work for common cases.

Date and Time Patterns

# ISO date: YYYY-MM-DD
grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' file

# US date: MM/DD/YYYY
grep -E '[0-9]{2}/[0-9]{2}/[0-9]{4}' file

# Syslog timestamp: Mon DD HH:MM:SS
grep -E '^[A-Z][a-z]{2} [ 0-9][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}' /var/log/messages

# 24-hour time: HH:MM:SS
grep -oE '[0-2][0-9]:[0-5][0-9]:[0-5][0-9]' logfile

# Extract today's log entries
grep "^$(date '+%b %e')" /var/log/messages

# Find entries in time range
awk '/10:00:00/,/11:00:00/' /var/log/messages

Log Analysis Patterns

# Find HTTP error codes (4xx, 5xx)
grep -E '" [45][0-9]{2} ' access.log

# Extract error messages
grep -oE 'error: [^,]+' application.log

# Find slow queries (over 1000ms)
grep -E 'query_time=[0-9]{4,}' mysql.log

# Match stack traces
grep -A20 'Exception' java.log

# Find repeated failed logins
grep 'Failed password' /var/log/secure | 
    grep -oE 'from [0-9.]+' | 
    sort | uniq -c | sort -rn

# Parse key=value pairs
grep -oE 'user=[^[:space:]]+' audit.log

Building Patterns Step by Step

Methodology for complex regex construction

  1. Identify what you need to match
  2. Start simple - match literal examples first
  3. Generalize - replace literals with character classes
  4. Add quantifiers - handle variable lengths
  5. Anchor - constrain position if needed
  6. Test - verify with sample data

Pattern Building Example

Goal: Match Apache log entries with 404 errors

# Step 1: Match literal example
grep '404' access.log

# Step 2: Match the error code in context (status field)
grep '" 404 ' access.log

# Step 3: Add flexibility for any 4xx error  
grep -E '" 4[0-9]{2} ' access.log

# Step 4: Extract relevant fields
grep -E '" 4[0-9]{2} ' access.log | awk '{ print $1, $7, $9 }'

# Step 5: Further refinement - get requested URLs
grep -E '" 4[0-9]{2} ' access.log | 
    awk '{ print $7 }' | 
    sort | uniq -c | sort -rn

Common Mistakes & Solutions

Mistake Example Solution
Unescaped dots 192.168.1.1 192\.168\.1\.1
Using shell glob syntax grep *.txt file grep '\.txt' file
BRE vs ERE confusion grep 'a+' file grep -E 'a+' file
Missing quotes grep $var file grep "$var" file
Greedy matching <.*> <[^>]*>
Case sensitivity grep 'Error' log grep -i 'error' log

Testing and Debugging Tools

# Test pattern interactively with color highlighting
grep --color=always 'pattern' file | less -R

# Show what's matching with -o
echo "test string here" | grep -oE 'pattern'

# Debug by building up pattern
grep 'simple' file      # Start here
grep 'simp.e' file      # Add complexity
grep -E 'simp.e+' file  # Add more

# Count matches vs lines
grep -c 'pattern' file      # Lines containing match
grep -o 'pattern' file | wc -l  # Total matches

# Perl-compatible regex for testing (if available)
grep -P '(?<=prefix)pattern(?=suffix)' file

💡 Online Tools: regex101.com, regexr.com for interactive testing

Lab Exercise

Analyze a web server access log:

  1. Find all requests resulting in 404 errors
  2. Extract unique IP addresses that generated errors
  3. Count requests per IP, find top 10 clients
  4. Find requests for PHP files (potential attacks)
  5. Extract all requested URLs containing "admin"
  6. Find requests with unusually long query strings (>100 chars)
# Sample log format:
# 192.168.1.100 - - [10/May/2024:10:15:30 +0000] "GET /page.html HTTP/1.1" 200 1234
# Start with: /var/log/httpd/access_log or generate test data

Key Takeaways

  • Metacharacters provide special matching capabilities
  • Character classes match sets of characters
  • Quantifiers control repetition (*, +, ?, {})
  • Anchors (^, $) match positions
  • Use ERE (grep -E) for cleaner syntax
  • Combine grep, sed, and awk for powerful text processing
  • Build patterns incrementally and test often

Practice: Use regex daily - every log file is an opportunity!

Quick Reference Card

Metacharacters

.Any character
^Start of line
$End of line
\Escape
[]Character class
[^]Negated class

Quantifiers (ERE)

*Zero or more
+One or more
?Zero or one
{n}Exactly n
{n,m}n to m times

POSIX Classes

[[:alpha:]]Letters
[[:digit:]]Digits
[[:alnum:]]Alphanumeric
[[:space:]]Whitespace

grep Options

-EExtended regex
-iCase insensitive
-vInvert match
-oOnly matching
-cCount

Additional Resources

  • man 7 regex - POSIX regex specification
  • man grep / sed / awk - Tool-specific regex details
  • info sed - Comprehensive sed documentation
  • regex101.com - Interactive regex tester
  • regexr.com - Visual regex builder
  • Regular-Expressions.info - Comprehensive tutorials

man grep | man sed | man awk

Questions?

grep -E '.*' your_questions.txt

RH134: Red Hat System Administration II
Regular Expressions for Text Matching