Blue Team Operations Guide
Monitoring Setup
Install Splunk Enterprise and forwarders on every Linux and Windows host before go-live.
Open Monitoring GuideDetection Queries
Ready-to-paste SPL alerts and dashboards for Linux, Windows, and server services.
Open Dashboards & AlertsOverview & Priorities
Your role is detection and response · not pentesting, not patching everything in sight. The red team will get in. Your job is to know when, where, and how, fast enough to contain it.
Priority order during prep
1. Asset inventory · you can't defend what you don't know exists.
2. Credential rotation · assume the red team has every default.
3. Baseline snapshots · you need a "known good" for diffing later.
4. Splunk ingestion · get logs flowing ASAP; you're blind without them.
5. Hardening · lock down services, disable junk, firewall rules.
6. Dashboards & alerts · build the cockpit you'll watch live.
7. Backups · snapshot everything before go-live.
Team roles (3-person)
Watcher
Eyes on Splunk dashboards, triages alerts, logs every event in the running journal.
Responder
Investigates the Watcher's leads · runs commands on hosts, pulls files, decides containment.
Reserve / Sleep
Off-shift, recovering. Rotates in for the next slot. Stay disciplined · don't burn out hour 2.
Shift Schedule
Three people · 2.5 days · 8-hour shifts · always 2 active, 1 on reserve/sleep.
| Slot | 00–08 | 08–16 | 16–24 |
|---|---|---|---|
| Day 1 | P3 sleep | P1 + P2 | P2 + P3 |
| Day 2 | P3 + P1 | P1 + P2 | P2 + P3 |
| Day 3 | P3 + P1 | P1 + P2 | · |
Golden Rules
- Log every action you take · timestamp, host, command, outcome.
- Never act on a host alone · buddy-check destructive commands before running.
- Write down findings immediately · if it's not in the journal, it didn't happen.
- Don't reboot servers without explicit team agreement · you may destroy live evidence.
- Trust nothing, verify twice · false positives come fast and waste cycles.
- If you panic, stand up. Walk 60 seconds. Then act.
- The red team wants you tunnel-visioned · keep the dashboard rotation discipline.
Asset Inventory
Build a master list of every host, IP, role, OS, and service before anything else. Without it you are blind.
Network discovery
# Discover live hosts on the local /24 sudo nmap -sn 192.168.1.0/24 -oA hosts_alive # Service/version scan on discovered hosts sudo nmap -sV -sC -O -iL hosts_alive.gnmap -oA services # Quick top-1000 sweep sudo nmap -T4 --top-ports 1000 192.168.1.0/24 -oA top1000
Inventory template
| IP | Hostname | OS | Role | Owner |
|---|---|---|---|---|
| 10.0.0.10 | dc01 | Win 2022 | Domain Controller | P1 |
| 10.0.0.20 | web01 | Ubuntu 22 | Apache + PHP | P2 |
| 10.0.0.30 | db01 | Debian 12 | MySQL | P2 |
| 10.0.0.40 | splunk | Ubuntu 22 | SIEM | P3 |
Baseline Snapshots
You need a "known good" of every host. Hash everything you can. When something feels off later, you diff against this.
# Hash all binaries in PATH find /usr/bin /usr/sbin /bin /sbin -type f -exec sha256sum {} \; > /root/baseline-bins.txt # Capture state of services, listeners, users, cron systemctl list-units --type=service --state=running > /root/baseline-services.txt ss -tlnp > /root/baseline-listeners.txt cat /etc/passwd /etc/shadow /etc/group > /root/baseline-accounts.txt crontab -l; ls -la /etc/cron* > /root/baseline-cron.txt # Pull the baseline off-host immediately scp /root/baseline-*.txt blue@splunk:/var/baselines/$(hostname)/
Get-Service | Where Status -eq Running | Export-Csv baseline-services.csv Get-Process | Export-Csv baseline-procs.csv Get-NetTCPConnection -State Listen | Export-Csv baseline-listeners.csv Get-LocalUser; Get-LocalGroupMember Administrators | Export-Csv baseline-admins.csv Get-ScheduledTask | Export-Csv baseline-tasks.csv
Harden Linux
SSH
PermitRootLogin no PasswordAuthentication no PubkeyAuthentication yes PermitEmptyPasswords no MaxAuthTries 3 ClientAliveInterval 300 ClientAliveCountMax 2 AllowUsers blue LoginGraceTime 30
sudo sshd -t # validate config sudo systemctl restart sshd
fail2ban (SSH protection)
sudo apt install -y fail2ban sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local # in jail.local: bantime = 1h, maxretry = 3, [sshd] enabled = true sudo systemctl enable --now fail2ban
Quick wins
- Disable unused services:
systemctl disable --now <svc> - Update packages:
apt update && apt upgrade -y - Enable unattended-upgrades for security patches
- Lock all unused user accounts:
passwd -l <user> - Set strong umask (027) in
/etc/profile - Enable auditd and forward to Splunk
- Install AIDE for file integrity monitoring (
aide --init && aide --check)
Harden Windows
Local policy quick-wins
- Rename and disable the local Administrator account; create a fresh admin
- Enforce password policy: 14+ chars, complexity, lockout after 5 fails
- Enable Windows Defender + tamper protection; ensure real-time scan is on
- Disable SMBv1:
Disable-WindowsOptionalFeature -Online -FeatureName SMB1Protocol - Disable LLMNR + NetBIOS broadcast (mitigate Responder/MITM)
- Enable PowerShell logging: ScriptBlock + Module + Transcription
- Enable Windows Event Forwarding or Splunk UF for Security/System/PowerShell logs
Disable LLMNR (GPO)
Computer Configuration → Administrative Templates
→ Network → DNS Client
→ Turn Off Multicast Name Resolution = Enabled
PowerShell logging
Set-ItemProperty -Path "HKLM:\Software\Policies\Microsoft\Windows\PowerShell\ScriptBlockLogging" -Name EnableScriptBlockLogging -Value 1 Set-ItemProperty -Path "HKLM:\Software\Policies\Microsoft\Windows\PowerShell\ModuleLogging" -Name EnableModuleLogging -Value 1
Firewall / Network Device Hardening
UFW (Ubuntu) baseline
sudo ufw default deny incoming sudo ufw default allow outgoing sudo ufw allow from 10.0.0.0/24 to any port 22 # SSH from LAN only sudo ufw allow 80,443/tcp # web sudo ufw allow from 10.0.0.40 to any port 9997 # Splunk receive sudo ufw enable sudo ufw status verbose
Network device principles
- Change default credentials on every router, switch, AP
- Disable Telnet/HTTP · SSH/HTTPS only
- ACLs: deny inbound by default, allow only what's needed
- Egress filter from DC: no internet from domain controllers
- Segment workstations from servers (separate VLANs if possible)
- Log to Splunk via syslog
Credential Rotation
Assume every default password is already burned. Rotate everything in the first hour.
Linux · bulk password reset
# Generate strong random password openssl rand -base64 24 # Force password change on next login passwd <user> chage -d 0 <user> # Lock dormant accounts usermod -L <user>
Windows / AD
# Force password reset for all enabled users Get-ADUser -Filter {Enabled -eq $true} | Set-ADUser -ChangePasswordAtLogon $true # Reset specific account with random password $pw = ConvertTo-SecureString ([System.Web.Security.Membership]::GeneratePassword(20,5)) -AsPlainText -Force Set-ADAccountPassword -Identity <user> -NewPassword $pw -Reset
Splunk Setup
The full Splunk Enterprise install + Universal Forwarder setup for Linux and Windows clients is on the Monitoring Guide page.
Quick start
Server install (single .deb), forwarders on every host, port 9997 for receive, port 8000 for the web UI.
Open Monitoring GuideForwarders
UF on Linux uses /opt/splunkforwarder; on Windows the MSI wizard with a receiving indexer pre-set.
Forwarder setupSplunk Dashboards
Pre-baked SPL queries for alerts and dashboards across Linux, Windows, and server services live on the Dashboards & Alerts page.
- Build the "Live Ops" dashboard first: failed logins timechart, top source IPs, host heartbeats
- Then build alerts (Save As → Alert) for the HIGH severity queries
- Set dashboard auto-refresh to 5 min · keeps it live
- Pin a "Last 24h log volume per host" panel · silent hosts = forwarder problem
Backups
Snapshot every important host before go-live. If the red team trashes a box, you restore from the snapshot, not from a panicked Google search.
VM snapshots
# VirtualBox VBoxManage snapshot <vmname> take "pre-ctf-baseline" --description "clean state" # VMware (per-VM) vmrun snapshot /path/to/vm.vmx "pre-ctf-baseline"
Application data
- MySQL/PostgreSQL: nightly
mysqldump --all-databases | gzipoff-host - Web roots:
tar -czf /backup/web-$(date +%F).tgz /var/www - AD:
wbadmin start systemstatebackupon the DC - Splunk indexes: stop Splunk → tar /opt/splunk/var/lib/splunk → restart
- Pull all backups to a separate "blue" host, not on the production net
Monitoring Playbook
What you do every hour as the Watcher. Discipline beats genius · keep the rotation.
Hourly rotation (all panels)
- Failed logins by source IP · anything over 50 in 5 min = brute force.
- Successful logins of Domain Admins · every one of these is investigated.
- New processes per host · anything unusual (powershell.exe spawning cmd, wmic, vssadmin) gets a ticket.
- Outbound network from servers · DC should have zero egress traffic.
- Host heartbeats · a silent host means a dead UF or a wiped log channel.
- HTTP 5xx / web errors · sudden spike = exploitation attempt.
- Privileged Windows events · 4720, 4732, 4740, 4624 type 10 (RDP), 4769 RC4.
Journal entry template
[14:23] FAILED-LOGIN-SPIKE src_ip: 10.0.0.66 (workstation05) target: db01:22 · 47 fails in 4 min action: blocked src_ip on db01 ufw, opened ticket status: contained, watching for further attempts from same /24
Triage & Escalation
| Sev | Examples | Action | SLA |
|---|---|---|---|
| P1 | DA login from unknown IP, krbtgt activity, ntds.dit access | Wake reserve, full team active, isolate DC | 0 min |
| P2 | Web shell, new admin, lateral SMB | Responder takes lead, contain host | 5 min |
| P3 | Brute force, recon, scan noise | Block source, log, monitor | 15 min |
| P4 | Single failed login, low-rate noise | Note in journal, ignore | · |
IR: Linux Compromise
You suspect a Linux host is owned. Run this checklist before rebooting.
# Active connections + listening sockets ss -tnp; ss -tlnp # Suspicious processes ps auxf ps -ef --forest # Recently modified files (last 24h) find / -mtime -1 -type f -not -path "/proc/*" -not -path "/sys/*" 2>/dev/null # Logged-in users + history w; last -i | head -30 cat /home/*/.bash_history /root/.bash_history 2>/dev/null # SUID/SGID newly added (compare against baseline) find / -perm /6000 -type f 2>/dev/null > /tmp/suid-now.txt diff /root/baseline-suid.txt /tmp/suid-now.txt
Containment options (least destructive first)
- Network isolation · drop firewall to deny all except Splunk + management host
- Kill the suspicious process (preserve memory dump first if possible)
- Disable the compromised account:
passwd -l user - Snapshot the host (if VM) before any further changes
- Restore from baseline snapshot as last resort
IR: Windows Compromise
# Active connections Get-NetTCPConnection | Where State -eq Established | Sort RemoteAddress # Recent processes (with command line) Get-CimInstance Win32_Process | Select Name, ProcessId, ParentProcessId, CommandLine | ft -auto # Newly created services Get-WinEvent -FilterHashtable @{LogName='System'; Id=7045} -MaxEvents 50 # Recent logins (Event 4624) Get-WinEvent -FilterHashtable @{LogName='Security'; Id=4624; StartTime=(Get-Date).AddHours(-2)} # Scheduled tasks added recently Get-ScheduledTask | Where Date -gt (Get-Date).AddDays(-1)
IR: Find Persistence Mechanisms
Common places attackers hide to survive reboots.
Linux
- cron:
crontab -l; ls -la /etc/cron.*; cat /etc/anacrontab - systemd:
systemctl list-unit-files --state=enabled - SSH keys: check every user's
~/.ssh/authorized_keys - Shell init:
~/.bashrc, ~/.bash_profile, /etc/profile.d/ - Loadable kernel modules:
lsmod; cat /etc/modules-load.d/* - SUID binaries (diff against baseline)
- rc.local, systemd timers, motd scripts
Windows
- Run keys:
HKLM\Software\Microsoft\Windows\CurrentVersion\Run(and HKCU) - Scheduled tasks:
schtasks /query /v - Services:
Get-Service | Where StartType -eq Automatic - WMI event subscriptions:
Get-WMIObject -Namespace root\subscription -Class __FilterToConsumerBinding - Startup folder:
shell:startup, shell:common startup - Image File Execution Options (debugger hijack)
- BITS jobs:
bitsadmin /list /allusers /verbose
IR: Lateral Movement
Spotting the attacker hopping host-to-host.
Indicators
- SMB: Windows Event 5140 admin share access (\\host\C$, ADMIN$)
- PSExec: service named PSEXESVC (Event 7045), or process PSEXESVC.exe
- WMI remote: WmiPrvSE.exe spawning unusual children
- RDP: Event 4624 with Logon_Type=10, especially from non-admin workstation
- SSH lateral (Linux): unusual user logins between hosts in
/var/log/auth.log - Pass-the-Hash: Event 4624 Logon_Type=3 with NTLM auth from unexpected source
Containment
- Block source host at the firewall · kill its outbound to internal targets
- Kill active sessions:
logoff <sessionid>on Windows, kill SSH PIDs on Linux - Reset credentials of every account that touched the source host
- Audit the destination hosts for new accounts, services, scheduled tasks
Common Attack Patterns
SSH brute force
High failed-login count from one IP. Block IP, ensure key-only auth, fail2ban active.
Web shell upload
Look for new .php, .aspx, .jsp files in web roots; outbound from www-data; POST with body containing cmd=.
Credential dumping
Sysmon Event 10 ProcessAccess on lsass.exe. Mimikatz/Rubeus signatures in command line.
Kerberoasting
Single user requesting many service tickets (Event 4769, RC4 encryption flag).
Pass-the-Hash
Event 4624 Logon_Type=3 + NTLM authentication from a non-DC source. Block laterally.
Persistence via scheduled task
New schtask running unsigned binary or PowerShell -enc. Disable, investigate parent.
Splunk Query Cheatsheet
# All events from one host in last hour index=* host=db01 earliest=-1h # Failed SSH logins by IP index=main sourcetype=linux_secure "Failed password" | stats count by src_ip, user | sort -count # Windows logon failures index=wineventlog EventCode=4625 | stats count by Account_Name, src_ip # Top sourcetypes / volume index=* earliest=-1h | stats count by host, sourcetype | sort -count # Live process creation (requires Sysmon or 4688) index=wineventlog EventCode=4688 New_Process_Name="*powershell*" | table _time, ComputerName, Process_Command_Line # Silent hosts (no logs in last 30 min) | metadata type=hosts | eval mins_silent=round((now()-recentTime)/60,1) | where mins_silent > 30 | sort -mins_silent
For more, see the dedicated Dashboards & Alerts page.
Critical Ports Reference
| Port | Proto | Service | Notes |
|---|---|---|---|
| 22 | TCP | SSH | Restrict to mgmt subnet |
| 53 | UDP/TCP | DNS | Watch for tunneling |
| 88 | TCP | Kerberos | DC only |
| 135 | TCP | RPC endpoint mapper | Windows only, internal |
| 389/636 | TCP | LDAP / LDAPS | DC only |
| 445 | TCP | SMB | Block between workstations |
| 3389 | TCP | RDP | Bastion only, never internet-exposed |
| 5985/5986 | TCP | WinRM | Internal mgmt only |
| 8000 | TCP | Splunk Web UI | Internal access only |
| 9997 | TCP | Splunk receive | From forwarders to indexer |
Log File Locations
Linux
| Path | Content |
|---|---|
| /var/log/auth.log | SSH, sudo, login |
| /var/log/syslog | General system events |
| /var/log/kern.log | Kernel messages |
| /var/log/apache2/access.log | Apache requests |
| /var/log/nginx/access.log | Nginx requests |
| /var/log/mysql/error.log | MySQL errors |
| ~/.bash_history | Per-user shell history |
| journalctl -u <svc> | systemd unit logs |
Windows
| Channel | Content |
|---|---|
| Security | Logons, account changes, audit (4624, 4625, 4720, 4732, 4740, 4769) |
| System | Service events (7045 = new service) |
| Application | App-level errors |
| Microsoft-Windows-PowerShell/Operational | 4103, 4104 PowerShell logs |
| Microsoft-Windows-Sysmon/Operational | 1/3/8/10/22 etc. (if Sysmon installed) |
Useful Tools
Sysmon (Windows)
Endpoint visibility on steroids · process create, network connect, image load, remote thread, registry. Use SwiftOnSecurity config as the baseline.
SwiftOnSecurity configauditd (Linux)
Kernel-level audit framework. Use auditctl to track file accesses, syscalls, and command execution.
chkrootkit / rkhunter
Rootkit detectors. Run during prep for a baseline; re-run during IR to spot kernel-level implants.
AIDE
File integrity checker. Compute baseline hashes, then aide --check later to find tampered files.
fail2ban
Watches log files and bans IPs that hit thresholds. Easy SSH/web brute force defense.
tcpdump / Wireshark
For packet capture during incidents. tcpdump -i any -w incident.pcap host <ip> for a quick capture.
Shift Handoff Template
Use this every shift change. Verbal walkthrough + journal entry.
SHIFT HANDOFF · <date> <time>
Outgoing: P1
Incoming: P2
ACTIVE INCIDENTS
1. <short title> · sev P2 · host: db01 · owner: P1
status: contained, monitoring egress
next-action: re-image at 14:00 if no further activity
OPEN TICKETS
- #007 SSH bruteforce 10.0.0.66 → blocked, watching
- #008 unusual cron on web01 → escalate to P3 if reappears
ENVIRONMENT CHANGES THIS SHIFT
- rotated credentials on dc01 (krbtgt 1st reset done, 2nd at 19:00)
- added 2 new alerts in Splunk (failed PowerShell -enc, new service)
WATCHLIST FOR INCOMING SHIFT
- krbtgt 2nd reset at 19:00 · must run before red team can reuse golden ticket
- Splunk disk at 78% · rotate older indexes if needed
UNCERTAIN / TO-INVESTIGATE
- one alert at 12:34 looked like recon from 10.0.0.99 · couldn't reach owner
- need second pair of eyes on web01 access.log for SQLi patterns