In the vast digital realm of Jekyll-powered GitHub Pages, where content flows like rivers of markdown and links connect distant territories of knowledge, a silent corruption threatens the very foundation of your domain. Broken hyperlinks - those severed pathways between digital realms - can transform a magnificent knowledge fortress into a maze of frustration for visiting adventurers.
The ancient DevOps masters speak of a legendary guardian system: an intelligent sentinel that tirelessly patrols every corner of your digital domain, testing each hyperlinkโs integrity and summoning AI-powered analysis to root out the causes of corruption. This automated guardian not only detects the broken pathways but uses artificial intelligence to understand why they failed and how to prevent future breaks.
In the modern era of digital content creation, maintaining link integrity across hundreds or thousands of pages becomes an insurmountable challenge for mortal developers. Every external site change, every moved resource, every deprecated API endpoint creates potential breaks in your carefully crafted knowledge network. The traditional approach of manual link checking scales poorly and often catches problems too late - after visitors have already encountered the digital equivalent of collapsed bridges.
The quest youโre about to embark upon will teach you to harness the combined power of GitHub Actions automation and artificial intelligence to create a proactive defense system. Your hyperlink guardian will not merely detect broken links; it will analyze patterns, identify root causes, and provide actionable intelligence to strengthen your digital domainโs infrastructure.
By the time you complete this epic automation journey, you will have mastered:
Youโll know youโve truly mastered this quest when you can:
Different platforms offer unique advantages for this DevOps automation quest. The core workflow runs on GitHubโs cloud infrastructure, but your development environment affects how youโll build and test the components.
# Install required tools using Homebrew
brew install node npm curl jq
npm install -g markdown-link-check
Perfect for developers who prefer the Unix-like environment with excellent Ruby/Jekyll support
# Install via Chocolatey or manual installation
choco install nodejs curl jq
npm install -g markdown-link-check
Windows Subsystem for Linux (WSL) provides an excellent alternative for Unix-like tools
# Ubuntu/Debian installation
sudo apt update
sudo apt install nodejs npm curl jq
npm install -g markdown-link-check
# CentOS/RHEL/Fedora installation
sudo dnf install nodejs npm curl jq
npm install -g markdown-link-check
Native environment for most CI/CD tools with excellent performance
The GitHub Actions environment provides all necessary tools out of the box You can also develop and test using GitHub Codespaces or other cloud IDEs
The first enchantment weโll craft is a powerful script that can discover every hyperlink hidden throughout your Jekyll domain. This isnโt merely about finding obvious markdown links - we need to detect links in HTML, frontmatter, data files, and even dynamically generated content.
#!/bin/bash
# hyperlink-guardian.sh - The core link detection and testing script
set -euo pipefail
# Configuration variables (can be overridden by environment)
SITE_URL="${SITE_URL:-https://bamr87.github.io/it-journey}"
OUTPUT_DIR="${OUTPUT_DIR:-./link-check-results}"
MAX_PARALLEL="${MAX_PARALLEL:-10}"
TIMEOUT="${TIMEOUT:-30}"
# Create output directory
mkdir -p "$OUTPUT_DIR"
echo "๐ Hyperlink Guardian: Beginning domain scan..."
echo "Target domain: $SITE_URL"
echo "Timestamp: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
# Function to extract all markdown files
find_markdown_files() {
find . -name "*.md" -o -name "*.markdown" | grep -v node_modules | grep -v .git
}
# Function to extract all HTML files (including generated Jekyll output)
find_html_files() {
find . -name "*.html" | grep -v node_modules | grep -v .git
}
# Function to extract links from markdown files
extract_markdown_links() {
local file="$1"
# Extract markdown links [text](url) and reference links
grep -oE '\[([^\]]*)\]\(([^)]+)\)' "$file" | sed 's/.*(\([^)]*\)).*/\1/' || true
# Extract reference-style links [text]: url
grep -oE '^\[([^\]]*)\]:\s*(.+)$' "$file" | sed 's/.*:\s*\(.*\)/\1/' || true
}
# Function to extract links from HTML files
extract_html_links() {
local file="$1"
# Extract href attributes
grep -oE 'href="([^"]*)"' "$file" | sed 's/href="//;s/"//' || true
# Extract src attributes (images, scripts)
grep -oE 'src="([^"]*)"' "$file" | sed 's/src="//;s/"//' || true
}
# Function to normalize and filter URLs
normalize_url() {
local url="$1"
# Skip empty URLs, anchors, and mailto links
if [[ -z "$url" || "$url" =~ ^# || "$url" =~ ^mailto: ]]; then
return 1
fi
# Convert relative URLs to absolute
if [[ "$url" =~ ^/ ]]; then
echo "${SITE_URL}${url}"
elif [[ "$url" =~ ^http ]]; then
echo "$url"
else
return 1 # Skip relative paths for now
fi
}
# Function to test a single URL
test_url() {
local url="$1"
local output_file="$2"
local status_code
local response_time
local error_message=""
# Use curl to test the URL
if response=$(curl -s -o /dev/null -w "%{http_code}|%{time_total}" \
--max-time "$TIMEOUT" \
--user-agent "IT-Journey-Hyperlink-Guardian/1.0" \
"$url" 2>&1); then
status_code=$(echo "$response" | cut -d'|' -f1)
response_time=$(echo "$response" | cut -d'|' -f2)
else
status_code="ERROR"
response_time="0"
error_message="$response"
fi
# Determine if link is broken
local status="PASS"
if [[ "$status_code" == "ERROR" ]] || [[ "$status_code" -ge 400 ]]; then
status="FAIL"
fi
# Output result in structured format
echo "$(date -u +"%Y-%m-%d %H:%M:%S")|$url|$status_code|$response_time|$status|$error_message" >> "$output_file"
if [[ "$status" == "FAIL" ]]; then
echo "โ BROKEN: $url (Status: $status_code)"
else
echo "โ
OK: $url (Status: $status_code, ${response_time}s)"
fi
}
# Main scanning function
scan_site_links() {
local all_links_file="$OUTPUT_DIR/all_links.txt"
local unique_links_file="$OUTPUT_DIR/unique_links.txt"
local results_file="$OUTPUT_DIR/test_results.csv"
echo "๐ Extracting links from markdown files..."
> "$all_links_file" # Clear file
while IFS= read -r file; do
echo "Scanning: $file"
extract_markdown_links "$file" >> "$all_links_file"
done < <(find_markdown_files)
echo "๐ Extracting links from HTML files..."
while IFS= read -r file; do
echo "Scanning: $file"
extract_html_links "$file" >> "$all_links_file"
done < <(find_html_files)
echo "๐ง Normalizing and deduplicating URLs..."
> "$unique_links_file" # Clear file
while IFS= read -r url; do
if normalized_url=$(normalize_url "$url"); then
echo "$normalized_url" >> "$unique_links_file"
fi
done < "$all_links_file"
sort "$unique_links_file" | uniq > "$unique_links_file.tmp"
mv "$unique_links_file.tmp" "$unique_links_file"
local total_links
total_links=$(wc -l < "$unique_links_file")
echo "๐ Found $total_links unique links to test"
# Initialize results file with header
echo "timestamp|url|status_code|response_time|status|error_message" > "$results_file"
echo "๐งช Testing links (max $MAX_PARALLEL parallel)..."
# Test URLs in parallel batches
local count=0
while IFS= read -r url; do
((count++))
echo "[$count/$total_links] Testing: $url"
# Run in background for parallel processing
test_url "$url" "$results_file" &
# Limit parallel processes
if (( count % MAX_PARALLEL == 0 )); then
wait # Wait for current batch to complete
fi
done < "$unique_links_file"
wait # Wait for any remaining background processes
echo "โ
Link testing complete! Results saved to $results_file"
}
# Function to generate summary statistics
generate_summary() {
local results_file="$OUTPUT_DIR/test_results.csv"
local summary_file="$OUTPUT_DIR/summary.json"
if [[ ! -f "$results_file" ]]; then
echo "โ Results file not found: $results_file"
return 1
fi
# Skip header line and calculate statistics
local total_links
local broken_links
local working_links
total_links=$(tail -n +2 "$results_file" | wc -l)
broken_links=$(tail -n +2 "$results_file" | grep -c "|FAIL|" || echo "0")
working_links=$((total_links - broken_links))
# Create JSON summary for AI analysis
cat > "$summary_file" << EOF
{
"scan_timestamp": "$(date -u +"%Y-%m-%d %H:%M:%S UTC")",
"site_url": "$SITE_URL",
"total_links": $total_links,
"working_links": $working_links,
"broken_links": $broken_links,
"success_rate": $(echo "scale=2; $working_links * 100 / $total_links" | bc -l 2>/dev/null || echo "0"),
"broken_link_details": [
EOF
# Add broken link details
local first=true
while IFS='|' read -r timestamp url status_code response_time status error_message; do
if [[ "$status" == "FAIL" ]]; then
if [[ "$first" == "true" ]]; then
first=false
else
echo "," >> "$summary_file"
fi
cat >> "$summary_file" << EOF
{
"url": "$url",
"status_code": "$status_code",
"error_message": "$error_message",
"timestamp": "$timestamp"
}EOF
fi
done < <(tail -n +2 "$results_file")
cat >> "$summary_file" << EOF
]
}
EOF
echo "๐ Summary generated: $summary_file"
echo "๐ Statistics:"
echo " Total Links: $total_links"
echo " Working: $working_links"
echo " Broken: $broken_links"
echo " Success Rate: $(echo "scale=1; $working_links * 100 / $total_links" | bc -l 2>/dev/null || echo "0")%"
}
# Main execution
main() {
scan_site_links
generate_summary
echo "๐ Hyperlink Guardian scan complete!"
echo "๐ Results available in: $OUTPUT_DIR"
}
# Run main function if script is executed directly
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi
Youโve successfully created a powerful link detection spell! Test it locally with a small subset of your site before deploying the full automation.
Now weโll weave the automation spell that transforms your link testing script into a tireless guardian that watches over your digital realm. This GitHub Actions workflow will run daily, execute comprehensive scans, and prepare data for AI analysis.
# .github/workflows/hyperlink-guardian.yml
name: ๐ Hyperlink Guardian - Daily Link Health Check
on:
schedule:
# Run every day at 3:00 AM UTC (adjust timezone as needed)
- cron: '0 3 * * *'
workflow_dispatch: # Allow manual triggering
inputs:
force_scan:
description: 'Force full site scan even if no changes detected'
required: false
default: 'false'
type: boolean
env:
SITE_URL: $
OUTPUT_DIR: './link-check-results'
jobs:
link-health-scan:
name: ๐ Scan Link Health
runs-on: ubuntu-latest
permissions:
contents: read
issues: write
steps:
- name: ๐ฐ Checkout Repository
uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for change detection
- name: ๐ง Setup Node.js Environment
uses: actions/setup-node@v4
with:
node-version: '18'
cache: 'npm'
- name: ๐ฆ Install Dependencies
run: |
npm install -g markdown-link-check
sudo apt-get update
sudo apt-get install -y curl jq bc
- name: ๐ ๏ธ Prepare Hyperlink Guardian Script
run: |
# Create the link checking script
cat > hyperlink-guardian.sh << 'SCRIPT_EOF'
#!/bin/bash
set -euo pipefail
# Configuration
SITE_URL="${SITE_URL:-https://bamr87.github.io/it-journey}"
OUTPUT_DIR="${OUTPUT_DIR:-./link-check-results}"
MAX_PARALLEL="${MAX_PARALLEL:-10}"
TIMEOUT="${TIMEOUT:-30}"
mkdir -p "$OUTPUT_DIR"
echo "๐ Hyperlink Guardian: Beginning domain scan..."
echo "Target domain: $SITE_URL"
echo "Timestamp: $(date -u +"%Y-%m-%d %H:%M:%S UTC")"
# Function definitions from Chapter 1 script...
# [Include all the functions from the previous script here]
# Main execution
scan_site_links() {
# Implement the scanning logic here
# This is a simplified version - use the full implementation from Chapter 1
local results_file="$OUTPUT_DIR/test_results.csv"
echo "timestamp|url|status_code|response_time|status|error_message" > "$results_file"
# Find all markdown files and extract links
find . -name "*.md" -o -name "*.markdown" | grep -v node_modules | grep -v .git | while read file; do
echo "Scanning: $file"
# Extract and test links (simplified for brevity)
grep -oE '\[([^\]]*)\]\(([^)]+)\)' "$file" | sed 's/.*(\([^)]*\)).*/\1/' | while read url; do
if [[ "$url" =~ ^http ]]; then
status_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time "$TIMEOUT" "$url" || echo "ERROR")
status="PASS"
if [[ "$status_code" == "ERROR" ]] || [[ "$status_code" -ge 400 ]]; then
status="FAIL"
fi
echo "$(date -u +"%Y-%m-%d %H:%M:%S")|$url|$status_code|0|$status|" >> "$results_file"
fi
done
done
}
generate_summary() {
local results_file="$OUTPUT_DIR/test_results.csv"
local summary_file="$OUTPUT_DIR/summary.json"
local total_links=$(tail -n +2 "$results_file" | wc -l)
local broken_links=$(tail -n +2 "$results_file" | grep -c "|FAIL|" || echo "0")
local working_links=$((total_links - broken_links))
cat > "$summary_file" << EOF
{
"scan_timestamp": "$(date -u +"%Y-%m-%d %H:%M:%S UTC")",
"site_url": "$SITE_URL",
"total_links": $total_links,
"working_links": $working_links,
"broken_links": $broken_links,
"success_rate": $(echo "scale=2; $working_links * 100 / $total_links" | bc -l 2>/dev/null || echo "0"),
"repository": "$GITHUB_REPOSITORY",
"commit_sha": "$GITHUB_SHA",
"broken_link_details": []
}
EOF
}
main() {
scan_site_links
generate_summary
echo "๐ Hyperlink Guardian scan complete!"
}
main "$@"
SCRIPT_EOF
chmod +x hyperlink-guardian.sh
- name: ๐ Execute Hyperlink Guardian Scan
run: |
echo "๐ Starting comprehensive link health check..."
./hyperlink-guardian.sh
echo "๐ Scan Results:"
if [[ -f "$OUTPUT_DIR/summary.json" ]]; then
cat "$OUTPUT_DIR/summary.json" | jq '.'
fi
- name: ๐ Upload Scan Results as Artifacts
uses: actions/upload-artifact@v4
with:
name: link-health-results-$
path: |
$/
retention-days: 30
- name: ๐ค Prepare AI Analysis Data
id: prepare-analysis
run: |
# Create a comprehensive data package for AI analysis
ANALYSIS_DIR="./ai-analysis-input"
mkdir -p "$ANALYSIS_DIR"
# Copy scan results
cp -r "$OUTPUT_DIR"/* "$ANALYSIS_DIR/"
# Add repository context
cat > "$ANALYSIS_DIR/repository_context.json" << EOF
{
"repository": "$GITHUB_REPOSITORY",
"branch": "$GITHUB_REF_NAME",
"commit_sha": "$GITHUB_SHA",
"workflow_run_id": "$GITHUB_RUN_ID",
"trigger": "$GITHUB_EVENT_NAME",
"site_url": "$SITE_URL"
}
EOF
# Add recent commit history for context
git log --oneline -10 > "$ANALYSIS_DIR/recent_commits.txt"
# Check if there are broken links
BROKEN_COUNT=$(jq -r '.broken_links' "$OUTPUT_DIR/summary.json" 2>/dev/null || echo "0")
echo "broken_count=$BROKEN_COUNT" >> $GITHUB_OUTPUT
# Create analysis prompt
cat > "$ANALYSIS_DIR/analysis_prompt.txt" << EOF
Please analyze the hyperlink health scan results for the IT-Journey repository.
Context:
- This is a Jekyll-based GitHub Pages educational site
- The site contains technical documentation, tutorials, and learning quests
- Links may be to external documentation, GitHub repositories, tools, or internal content
Analysis Requirements:
1. Summarize the overall link health status
2. Categorize broken links by type (external sites, GitHub repos, documentation, etc.)
3. Identify patterns in link failures (specific domains, types of content, etc.)
4. Provide root cause analysis for common failure types
5. Suggest specific remediation actions for each broken link
6. Recommend preventive measures to avoid future link rot
7. Assess the impact on user experience and learning outcomes
Please provide actionable insights that help maintain the educational value of this learning platform.
EOF
echo "๐ค AI analysis data prepared in $ANALYSIS_DIR"
ls -la "$ANALYSIS_DIR"
outputs:
broken_count: $
ai-analysis:
name: ๐ง AI-Powered Link Analysis
needs: link-health-scan
runs-on: ubuntu-latest
if: needs.link-health-scan.outputs.broken_count > 0
permissions:
contents: read
issues: write
steps:
- name: ๐ฐ Checkout Repository
uses: actions/checkout@v4
- name: ๐ฅ Download Scan Results
uses: actions/download-artifact@v4
with:
name: link-health-results-$
path: ./analysis-input
- name: ๐ง Execute AI Analysis
id: ai-analysis
env:
OPENAI_API_KEY: $
run: |
# Create AI analysis script
cat > ai_analyzer.py << 'PYTHON_EOF'
import json
import os
import sys
from datetime import datetime
try:
import openai
except ImportError:
print("Installing OpenAI library...")
os.system("pip install openai")
import openai
def analyze_link_health():
# Load scan results
with open('./analysis-input/summary.json', 'r') as f:
summary = json.load(f)
# Load repository context
with open('./analysis-input/repository_context.json', 'r') as f:
repo_context = json.load(f)
# Load analysis prompt
with open('./analysis-input/analysis_prompt.txt', 'r') as f:
base_prompt = f.read()
# Prepare data for AI analysis
analysis_data = {
"scan_summary": summary,
"repository_context": repo_context,
"analysis_prompt": base_prompt
}
# Configure OpenAI client
client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])
# Create analysis prompt
prompt = f"""
{base_prompt}
Scan Results Summary:
{json.dumps(summary, indent=2)}
Repository Context:
{json.dumps(repo_context, indent=2)}
Please provide a comprehensive analysis in JSON format with the following structure:
```json
{
"executive_summary": "Brief overview of link health status",
"broken_links_analysis": [
{
"url": "broken_url",
"issue_type": "category",
"root_cause": "explanation",
"recommended_action": "specific_fix",
"priority": "high|medium|low"
}
]
}
],
"patterns_identified": ["pattern1", "pattern2"],
"preventive_measures": ["measure1", "measure2"],
"overall_recommendations": ["recommendation1", "recommendation2"],
"impact_assessment": "description of impact on users"
}}
"""
# Make API call
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are an expert DevOps engineer and technical writer specializing in maintaining high-quality educational content platforms."},
{"role": "user", "content": prompt}
],
max_tokens=2000
)
# Extract and save analysis
analysis_result = response.choices[0].message.content
# Try to parse as JSON, fallback to text if needed
try:
analysis_json = json.loads(analysis_result)
with open('./ai_analysis_result.json', 'w') as f:
json.dump(analysis_json, f, indent=2)
except json.JSONDecodeError:
# Save as text if JSON parsing fails
with open('./ai_analysis_result.txt', 'w') as f:
f.write(analysis_result)
analysis_json = {"raw_analysis": analysis_result}
return analysis_json
if __name__ == "__main__":
try:
result = analyze_link_health()
print("โ
AI analysis completed successfully")
print(json.dumps(result, indent=2))
except Exception as e:
print(f"โ AI analysis failed: {str(e)}")
sys.exit(1)
PYTHON_EOF
python ai_analyzer.py
# Set output for next step
if [[ -f "./ai_analysis_result.json" ]]; then
echo "analysis_file=ai_analysis_result.json" >> $GITHUB_OUTPUT
else
echo "analysis_file=ai_analysis_result.txt" >> $GITHUB_OUTPUT
fi
- name: ๐ Create GitHub Issue with Analysis
uses: actions/github-script@v7
env:
ANALYSIS_FILE: $
with:
script: |
const fs = require('fs');
const path = require('path');
// Load scan summary
const summary = JSON.parse(fs.readFileSync('./analysis-input/summary.json', 'utf8'));
// Load AI analysis
let aiAnalysis;
const analysisPath = `./${process.env.ANALYSIS_FILE}`;
if (process.env.ANALYSIS_FILE.endsWith('.json')) {
aiAnalysis = JSON.parse(fs.readFileSync(analysisPath, 'utf8'));
} else {
aiAnalysis = { raw_analysis: fs.readFileSync(analysisPath, 'utf8') };
}
// Create issue body
let issueBody = `# ๐ Hyperlink Guardian Report
**Scan Date**: ${summary.scan_timestamp}
**Repository**: ${summary.repository || context.repo.owner + '/' + context.repo.repo}
**Site URL**: ${summary.site_url}
## ๐ Summary Statistics
- **Total Links Tested**: ${summary.total_links}
- **Working Links**: ${summary.working_links}
- **Broken Links**: ${summary.broken_links}
- **Success Rate**: ${summary.success_rate}%
`;
if (summary.broken_links > 0) {
issueBody += `## โ Broken Links Detected
`;
if (aiAnalysis.broken_links_analysis) {
issueBody += `### ๐ง AI Analysis & Recommendations
**Executive Summary**: ${aiAnalysis.executive_summary || 'Analysis completed'}
#### Broken Link Details:
`;
aiAnalysis.broken_links_analysis.forEach((link, index) => {
issueBody += `
**${index + 1}. ${link.url}**
- **Issue Type**: ${link.issue_type || 'Unknown'}
- **Root Cause**: ${link.root_cause || 'Analysis pending'}
- **Recommended Action**: ${link.recommended_action || 'Manual review required'}
- **Priority**: ${link.priority || 'Medium'}
`;
});
if (aiAnalysis.patterns_identified && aiAnalysis.patterns_identified.length > 0) {
issueBody += `#### ๐ Patterns Identified:
${aiAnalysis.patterns_identified.map(pattern => `- ${pattern}`).join('\n')}
`;
}
if (aiAnalysis.preventive_measures && aiAnalysis.preventive_measures.length > 0) {
issueBody += `#### ๐ก๏ธ Preventive Measures:
${aiAnalysis.preventive_measures.map(measure => `- ${measure}`).join('\n')}
`;
}
if (aiAnalysis.overall_recommendations && aiAnalysis.overall_recommendations.length > 0) {
issueBody += `#### ๐ก Overall Recommendations:
${aiAnalysis.overall_recommendations.map(rec => `- ${rec}`).join('\n')}
`;
}
if (aiAnalysis.impact_assessment) {
issueBody += `#### ๐ Impact Assessment:
${aiAnalysis.impact_assessment}
`;
}
} else if (aiAnalysis.raw_analysis) {
issueBody += `### ๐ง AI Analysis:
${aiAnalysis.raw_analysis}
`;
}
if (summary.broken_link_details && summary.broken_link_details.length > 0) {
issueBody += `### ๐ Raw Link Test Results:
| URL | Status Code | Error Message |
|-----|-------------|---------------|
`;
summary.broken_link_details.forEach(link => {
issueBody += `| ${link.url} | ${link.status_code} | ${link.error_message || 'N/A'} |\n`;
});
}
} else {
issueBody += `## โ
All Links Healthy
Great news! All ${summary.total_links} links are working correctly.
`;
}
issueBody += `
---
**Workflow Run**: [#${context.runNumber}](${context.payload.repository.html_url}/actions/runs/${context.runId})
**Commit**: ${context.sha.substring(0, 7)}
This issue was automatically created by the Hyperlink Guardian workflow. ๐ค
`;
// Create the issue
const issue = await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `๐ Hyperlink Health Report - ${summary.broken_links > 0 ? summary.broken_links + ' broken links detected' : 'All links healthy'} (${new Date(summary.scan_timestamp).toLocaleDateString()})`,
body: issueBody,
labels: summary.broken_links > 0 ? ['bug', 'links', 'automated-report'] : ['maintenance', 'links', 'automated-report']
});
console.log(`Created issue: ${issue.data.html_url}`);
cleanup:
name: ๐งน Cleanup Old Reports
needs: [link-health-scan, ai-analysis]
runs-on: ubuntu-latest
if: always()
permissions:
contents: read
issues: write
steps:
- name: ๐ฐ Checkout Repository
uses: actions/checkout@v4
- name: ๐๏ธ Close Old Link Health Issues
uses: actions/github-script@v7
with:
script: |
// Find and close old automated link health reports
const { data: issues } = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
labels: 'automated-report,links',
state: 'open'
});
// Close issues older than 7 days
const oneWeekAgo = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000);
for (const issue of issues) {
const issueDate = new Date(issue.created_at);
if (issueDate < oneWeekAgo) {
await github.rest.issues.update({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issue.number,
state: 'closed'
});
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issue.number,
body: 'Automatically closed by Hyperlink Guardian - report is more than 7 days old. ๐ค'
});
console.log(`Closed old issue: #${issue.number}`);
}
}
Your automated guardian workflow is now ready to protect your digital realm! Test it with a manual trigger before relying on the scheduled execution.
The true power of our hyperlink guardian lies not just in detecting broken links, but in understanding WHY they break and HOW to prevent future failures. This chapter will enhance your workflow with artificial intelligence that can analyze patterns, identify root causes, and provide strategic recommendations.
# scripts/ai_link_analyzer.py - Enhanced AI Analysis Engine
import json
import os
import sys
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Tuple
import re
from urllib.parse import urlparse
try:
import openai
import requests
except ImportError:
print("Installing required packages...")
os.system("pip install openai requests")
import openai
import requests
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class HyperlinkIntelligenceEngine:
"""
Advanced AI-powered analysis engine for hyperlink health intelligence
"""
def __init__(self, api_key: str):
self.client = openai.OpenAI(api_key=api_key)
self.analysis_timestamp = datetime.utcnow().isoformat()
def load_scan_data(self, input_dir: str) -> Dict:
"""Load all scan data and context for analysis"""
try:
# Load primary scan results
with open(f"{input_dir}/summary.json", 'r') as f:
summary = json.load(f)
# Load repository context
with open(f"{input_dir}/repository_context.json", 'r') as f:
repo_context = json.load(f)
# Load detailed test results if available
detailed_results = []
results_file = f"{input_dir}/test_results.csv"
if os.path.exists(results_file):
with open(results_file, 'r') as f:
lines = f.readlines()[1:] # Skip header
for line in lines:
parts = line.strip().split('|')
if len(parts) >= 6:
detailed_results.append({
'timestamp': parts[0],
'url': parts[1],
'status_code': parts[2],
'response_time': parts[3],
'status': parts[4],
'error_message': parts[5] if len(parts) > 5 else ''
})
return {
'summary': summary,
'repository_context': repo_context,
'detailed_results': detailed_results,
'analysis_timestamp': self.analysis_timestamp
}
except Exception as e:
logger.error(f"Failed to load scan data: {str(e)}")
raise
def categorize_broken_links(self, broken_links: List[Dict]) -> Dict[str, List[Dict]]:
"""Categorize broken links by type and domain for pattern analysis"""
categories = {
'external_documentation': [],
'github_repositories': [],
'academic_resources': [],
'commercial_tools': [],
'internal_links': [],
'deprecated_services': [],
'temporary_failures': [],
'unknown': []
}
for link in broken_links:
url = link.get('url', '')
status_code = link.get('status_code', '')
error_message = link.get('error_message', '')
# Parse URL for analysis
parsed = urlparse(url)
domain = parsed.netloc.lower()
# Categorization logic
if 'github.com' in domain or 'gitlab.com' in domain:
categories['github_repositories'].append(link)
elif any(doc_site in domain for doc_site in ['docs.', 'documentation.', 'wiki.', 'manual.']):
categories['external_documentation'].append(link)
elif any(academic in domain for academic in ['.edu', 'arxiv.org', 'scholar.google']):
categories['academic_resources'].append(link)
elif status_code in ['500', '502', '503', '504']:
categories['temporary_failures'].append(link)
elif 'timeout' in error_message.lower() or 'connection' in error_message.lower():
categories['temporary_failures'].append(link)
elif parsed.netloc == '' or url.startswith('/'):
categories['internal_links'].append(link)
elif status_code == '404':
# Could be moved/deprecated content
categories['deprecated_services'].append(link)
else:
categories['unknown'].append(link)
# Remove empty categories
return {k: v for k, v in categories.items() if v}
def identify_patterns(self, scan_data: Dict) -> List[str]:
"""Identify patterns in link failures"""
patterns = []
broken_links = scan_data['summary'].get('broken_link_details', [])
if not broken_links:
return patterns
# Domain pattern analysis
domains = {}
for link in broken_links:
domain = urlparse(link.get('url', '')).netloc
domains[domain] = domains.get(domain, 0) + 1
# Check for domain-specific issues
for domain, count in domains.items():
if count > 1:
patterns.append(f"Multiple failures from domain: {domain} ({count} links)")
# Status code pattern analysis
status_codes = {}
for link in broken_links:
code = link.get('status_code', '')
status_codes[code] = status_codes.get(code, 0) + 1
for code, count in status_codes.items():
if count > 2:
patterns.append(f"Frequent {code} errors ({count} occurrences)")
# Time-based patterns (if we had historical data)
# This could be enhanced with trend analysis
return patterns
def generate_ai_analysis(self, scan_data: Dict) -> Dict:
"""Generate comprehensive AI analysis of link health"""
# Prepare data for AI analysis
broken_links = scan_data['summary'].get('broken_link_details', [])
categorized_links = self.categorize_broken_links(broken_links)
identified_patterns = self.identify_patterns(scan_data)
# Create detailed analysis prompt
analysis_prompt = f"""
As an expert DevOps engineer and technical content strategist, analyze this hyperlink health report for an educational IT platform.
CONTEXT:
- Repository: {scan_data['repository_context'].get('repository', 'Unknown')}
- Site URL: {scan_data['summary'].get('site_url', 'Unknown')}
- Total Links: {scan_data['summary'].get('total_links', 0)}
- Broken Links: {scan_data['summary'].get('broken_links', 0)}
- Success Rate: {scan_data['summary'].get('success_rate', 0)}%
BROKEN LINKS BY CATEGORY:
{json.dumps(categorized_links, indent=2)}
IDENTIFIED PATTERNS:
{json.dumps(identified_patterns, indent=2)}
DETAILED LINK DATA:
{json.dumps(broken_links[:20], indent=2)} # Limit to first 20 for token efficiency
ANALYSIS REQUIREMENTS:
1. Provide an executive summary of the link health situation
2. Analyze each broken link category and suggest specific remediation strategies
3. Identify root causes for the most common failure patterns
4. Recommend immediate actions prioritized by impact and effort
5. Suggest long-term preventive measures for maintaining link health
6. Assess the educational impact on learners and site users
7. Provide specific technical implementation recommendations
Please respond in valid JSON format with this structure:
```json
{
"executive_summary": "Brief overview highlighting key issues and overall health",
"category_analysis": {
"category_name": {
"impact": "high|medium|low",
"root_cause": "Primary reason for failures in this category",
"recommended_actions": ["action1", "action2"],
"timeline": "immediate|short-term|long-term"
}
},
"priority_actions": [
{
"action": "Specific action to take",
"priority": "high|medium|low",
"effort": "low|medium|high",
"impact": "high|medium|low",
"timeline": "immediate|short-term|long-term"
}
],
"preventive_measures": [
{
"measure": "Description of preventive action",
"implementation": "How to implement this measure",
"automation_potential": "Can this be automated? How?"
}
],
"educational_impact": "Assessment of how broken links affect learning outcomes",
"technical_recommendations": [
{
"recommendation": "Technical improvement suggestion",
"justification": "Why this recommendation is important",
"implementation_complexity": "low|medium|high"
}
],
"monitoring_suggestions": [
"Enhanced monitoring recommendations"
]
}
```
"""
try:
# Make API call to AI service
response = self.client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are an expert DevOps engineer, technical writer, and educational content strategist with deep expertise in maintaining high-quality learning platforms. Provide actionable, specific, and technically sound recommendations."
},
{
"role": "user",
"content": analysis_prompt
}
],
max_tokens=3000,
temperature=0.3 # Lower temperature for more focused, technical responses
)
# Parse AI response
ai_response = response.choices[0].message.content
try:
# Try to parse as JSON
analysis_result = json.loads(ai_response)
except json.JSONDecodeError:
# Fallback: extract JSON from response if it's wrapped in markdown
json_match = re.search(r'```json\s*(.*?)\s*```', ai_response, re.DOTALL)
if json_match:
analysis_result = json.loads(json_match.group(1))
else:
# If JSON parsing fails completely, return structured fallback
analysis_result = {
"executive_summary": "AI analysis completed with parsing issues",
"raw_analysis": ai_response,
"analysis_status": "partial"
}
# Add metadata
analysis_result['ai_analysis_metadata'] = {
'model_used': 'gpt-4',
'analysis_timestamp': self.analysis_timestamp,
'tokens_used': response.usage.total_tokens if hasattr(response, 'usage') else 'unknown',
'broken_links_analyzed': len(broken_links)
}
return analysis_result
except Exception as e:
logger.error(f"AI analysis failed: {str(e)}")
# Return fallback analysis
return {
"executive_summary": f"Automated analysis detected {len(broken_links)} broken links requiring attention",
"error": str(e),
"fallback_analysis": True,
"broken_links_count": len(broken_links),
"categories_identified": list(categorized_links.keys()),
"patterns_identified": identified_patterns
}
def generate_actionable_report(self, analysis_result: Dict, scan_data: Dict) -> str:
"""Generate a comprehensive, actionable report for GitHub issues"""
summary = scan_data['summary']
repo_context = scan_data['repository_context']
report = f"""# ๐ Hyperlink Guardian Intelligence Report
## ๐ Executive Dashboard
**Scan Timestamp**: {summary.get('scan_timestamp', 'Unknown')}
**Repository**: {repo_context.get('repository', 'Unknown')}
**Site URL**: {summary.get('site_url', 'Unknown')}
**Workflow Run**: #{repo_context.get('workflow_run_id', 'Unknown')}
### Health Metrics
- ๐ **Total Links**: {summary.get('total_links', 0)}
- โ
**Working Links**: {summary.get('working_links', 0)}
- โ **Broken Links**: {summary.get('broken_links', 0)}
- ๐ **Success Rate**: {summary.get('success_rate', 0):.1f}%
"""
if analysis_result.get('executive_summary'):
report += f"""## ๐ง AI Analysis Summary
{analysis_result['executive_summary']}
"""
# Priority Actions Section
if analysis_result.get('priority_actions'):
report += f"""## ๐ฏ Priority Actions
"""
for i, action in enumerate(analysis_result['priority_actions'], 1):
priority_emoji = {"high": "๐ด", "medium": "๐ก", "low": "๐ข"}.get(action.get('priority', 'medium'), "๐ก")
effort_emoji = {"low": "โก", "medium": "โ๏ธ", "high": "๐๏ธ"}.get(action.get('effort', 'medium'), "โ๏ธ")
report += f"""### {i}. {action.get('action', 'Action needed')}
- **Priority**: {priority_emoji} {action.get('priority', 'Medium').title()}
- **Effort**: {effort_emoji} {action.get('effort', 'Medium').title()}
- **Impact**: {action.get('impact', 'Medium').title()}
- **Timeline**: {action.get('timeline', 'Unknown')}
"""
# Category Analysis
if analysis_result.get('category_analysis'):
report += f"""## ๐ Broken Link Category Analysis
"""
for category, details in analysis_result['category_analysis'].items():
impact_emoji = {"high": "๐ด", "medium": "๐ก", "low": "๐ข"}.get(details.get('impact', 'medium'), "๐ก")
report += f"""### {category.replace('_', ' ').title()}
- **Impact**: {impact_emoji} {details.get('impact', 'Medium').title()}
- **Root Cause**: {details.get('root_cause', 'Analysis needed')}
- **Timeline**: {details.get('timeline', 'Unknown')}
**Recommended Actions**:
"""
for action in details.get('recommended_actions', []):
report += f"- {action}\n"
report += "\n"
# Technical Recommendations
if analysis_result.get('technical_recommendations'):
report += f"""## ๐ง Technical Recommendations
"""
for i, rec in enumerate(analysis_result['technical_recommendations'], 1):
complexity_emoji = {"low": "๐ข", "medium": "๐ก", "high": "๐ด"}.get(rec.get('implementation_complexity', 'medium'), "๐ก")
report += f"""### {i}. {rec.get('recommendation', 'Recommendation')}
- **Complexity**: {complexity_emoji} {rec.get('implementation_complexity', 'Medium').title()}
- **Justification**: {rec.get('justification', 'Details needed')}
"""
# Preventive Measures
if analysis_result.get('preventive_measures'):
report += f"""## ๐ก๏ธ Preventive Measures
"""
for i, measure in enumerate(analysis_result['preventive_measures'], 1):
report += f"""### {i}. {measure.get('measure', 'Preventive measure')}
**Implementation**: {measure.get('implementation', 'Details needed')}
**Automation Potential**: {measure.get('automation_potential', 'Assessment needed')}
"""
# Educational Impact
if analysis_result.get('educational_impact'):
report += f"""## ๐ Educational Impact Assessment
{analysis_result['educational_impact']}
"""
# Monitoring Suggestions
if analysis_result.get('monitoring_suggestions'):
report += f"""## ๐ Enhanced Monitoring Recommendations
"""
for suggestion in analysis_result['monitoring_suggestions']:
report += f"- {suggestion}\n"
# Raw Data Section
if summary.get('broken_link_details'):
report += f"""## ๐ Detailed Link Test Results
| URL | Status | Error Details |
|-----|--------|---------------|
"""
for link in summary['broken_link_details'][:20]: # Limit to prevent overly long issues
url = link.get('url', 'Unknown')[:80] + ('...' if len(link.get('url', '')) > 80 else '')
status = link.get('status_code', 'Unknown')
error = link.get('error_message', 'N/A')[:50] + ('...' if len(link.get('error_message', '')) > 50 else '')
report += f"| {url} | {status} | {error} |\n"
if len(summary['broken_link_details']) > 20:
report += f"\n*Showing first 20 of {len(summary['broken_link_details'])} broken links*\n"
# Metadata footer
report += f"""
---
## ๐ค Analysis Metadata
**AI Model**: {analysis_result.get('ai_analysis_metadata', {}).get('model_used', 'Unknown')}
**Analysis Timestamp**: {analysis_result.get('ai_analysis_metadata', {}).get('analysis_timestamp', 'Unknown')}
**Links Analyzed**: {analysis_result.get('ai_analysis_metadata', {}).get('broken_links_analyzed', 'Unknown')}
*This report was automatically generated by the Hyperlink Guardian with AI-powered analysis.*
"""
return report
def main():
"""Main execution function for AI analysis"""
# Configuration
input_dir = "./analysis-input"
output_file = "./ai_analysis_result.json"
report_file = "./analysis_report.md"
# Check for required environment variables
api_key = os.environ.get('OPENAI_API_KEY')
if not api_key:
logger.error("OPENAI_API_KEY environment variable is required")
sys.exit(1)
try:
# Initialize AI engine
logger.info("๐ง Initializing Hyperlink Intelligence Engine...")
ai_engine = HyperlinkIntelligenceEngine(api_key)
# Load scan data
logger.info("๐ Loading scan data...")
scan_data = ai_engine.load_scan_data(input_dir)
# Generate AI analysis
logger.info("๐ Generating AI analysis...")
analysis_result = ai_engine.generate_ai_analysis(scan_data)
# Save analysis result
with open(output_file, 'w') as f:
json.dump(analysis_result, f, indent=2)
logger.info(f"๐พ Analysis saved to {output_file}")
# Generate actionable report
logger.info("๐ Generating actionable report...")
report = ai_engine.generate_actionable_report(analysis_result, scan_data)
with open(report_file, 'w') as f:
f.write(report)
logger.info(f"๐ Report saved to {report_file}")
# Output summary for GitHub Actions
broken_count = scan_data['summary'].get('broken_links', 0)
logger.info(f"โ
Analysis complete! Found {broken_count} broken links.")
return analysis_result
except Exception as e:
logger.error(f"โ Analysis failed: {str(e)}")
sys.exit(1)
if __name__ == "__main__":
main()
Your AI analysis engine is now capable of providing intelligent insights about link failures! Test it with sample data before integrating into the full workflow.
Objective: Set up a local testing environment to validate your hyperlink guardian before deployment
Requirements:
Success Criteria:
Objective: Enhance the guardian with specialized testing capabilities
Requirements:
Success Criteria:
Objective: Optimize the GitHub Actions workflow for efficiency and reliability
Requirements:
Success Criteria:
Objective: Create an enterprise-ready hyperlink monitoring system
Requirements:
Success Criteria:
Comprehensive checklist that proves the learner has achieved mastery
Cross-Technology Skills: Advanced automation concepts apply to any CI/CD platform Career Path Integration: DevOps engineering, site reliability engineering, and quality assurance roles Project Application: Any web application or documentation site requiring link integrity monitoring
Congratulations, noble guardian! You have successfully forged an intelligent sentinel that will tirelessly protect your digital realm from the corruption of broken hyperlinks. Your hyperlink guardian combines the precision of automated testing with the wisdom of artificial intelligence, creating a system that not only detects problems but understands their causes and provides actionable solutions.
With this quest complete, you now possess the knowledge to create sophisticated DevOps automation that leverages AI for intelligent analysis and reporting. Your guardian will serve as a template for building other automated quality assurance systems that enhance the reliability and user experience of digital platforms.
May your links remain strong, your automation resilient, and your intelligence artificial yet wise! ๐โจ