Transform repository into a plugin marketplace structure with two plugins: - claude-code plugin: Complete toolkit with 5 skills * claude-code-plugins * claude-code-slash-commands * claude-code-hooks * claude-code-subagents * claude-code-memory - claude-skills plugin: Meta-skill for creating Agent Skills * Comprehensive best practices guide * Templates and examples * Progressive disclosure patterns Infrastructure: - Add marketplace.json manifest - Create plugin.json for each plugin - Update documentation for marketplace structure - Add contribution and testing guides Installation: - /plugin install claude-code@claude-skills - /plugin install claude-skills@claude-skills 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
10 KiB
Testing and Validating Skills
This guide helps you validate skills before adding them to the repository or using them in production.
Quick Validation Checklist
Run through this checklist before submitting a skill:
Metadata
[ ] SKILL.md exists
[ ] YAML frontmatter is valid
[ ] Name ≤ 64 characters
[ ] Description ≤ 1024 characters
[ ] Description includes trigger scenarios
Content Quality
[ ] "When to Use This Skill" section present
[ ] At least one concrete example
[ ] Examples are runnable/testable
[ ] File references are accurate
[ ] No sensitive data hardcoded
Triggering Tests
[ ] Triggers on target scenarios
[ ] Doesn't trigger on unrelated scenarios
[ ] No conflicts with similar skills
Security
[ ] No credentials or API keys
[ ] No personal information
[ ] Safe file system access only
[ ] External dependencies verified
Detailed Testing Process
1. Metadata Validation
Test YAML Parsing
Try parsing the frontmatter:
# Extract and validate YAML
head -n 10 SKILL.md | grep -A 3 "^---$"
Verify:
- YAML is valid (no syntax errors)
- Both
nameanddescriptionare present - Values are within character limits
Character Limits
# Count characters in name (must be ≤ 64)
grep "^name:" SKILL.md | sed 's/name: //' | wc -c
# Count characters in description (must be ≤ 1024)
grep "^description:" SKILL.md | sed 's/description: //' | wc -c
2. Content Quality Testing
Check Required Sections
# Verify "When to Use This Skill" section exists
grep -i "when to use" SKILL.md
# Verify examples exist
grep -i "example" SKILL.md
Test File References
If skill references other files, verify they exist:
# Find markdown links
grep -o '\[.*\]([^)]*\.md)' SKILL.md
# Check if referenced files exist
# (manually verify each one)
Validate Examples
For each example in the skill:
- Try running the code/commands
- Verify output matches expectations
- Check for edge cases
- Ensure examples are complete (no placeholders)
3. Trigger Testing
This is the most important validation step.
Create Test Scenarios
Positive Tests (SHOULD trigger)
Create a list of scenarios where the skill should activate:
Test Scenario 1: [Describe task that should trigger]
Expected: Skill activates
Actual: [Test result]
Test Scenario 2: [Another trigger case]
Expected: Skill activates
Actual: [Test result]
Negative Tests (SHOULD NOT trigger)
Create scenarios where the skill should NOT activate:
Test Scenario 3: [Similar but different task]
Expected: Skill does NOT activate
Actual: [Test result]
Test Scenario 4: [Unrelated task]
Expected: Skill does NOT activate
Actual: [Test result]
Example Testing Session
For a "Python Testing with pytest" skill:
Should Trigger:
- "Help me write tests for my Python function"
- "How do I use pytest fixtures?"
- "Create unit tests for this class"
Should NOT Trigger:
- "Help me test my JavaScript code" (different language)
- "Debug my pytest installation" (installation, not testing)
- "Explain what unit testing is" (concept, not implementation)
Run Tests with Claude
- Load the skill
- Ask Claude each test question
- Observe if skill triggers (check response for skill context)
- Document results
4. Token Efficiency Testing
Measure Content Size
# Count tokens (approximate: words × 1.3)
wc -w SKILL.md
# Or use a proper token counter
# (tokens ≈ characters ÷ 4 for rough estimate)
wc -c SKILL.md
Evaluate Split Points
Ask yourself:
- Is content loaded only when needed?
- Could mutually exclusive sections be split?
- Are examples concise but complete?
- Is reference material in separate files?
Target sizes:
- SKILL.md: Under 3000 tokens (core workflows)
- Additional files: Load only when referenced
- Total metadata: ~100 tokens
5. Security Validation
Automated Checks
# Check for potential secrets
grep -iE "(password|api[_-]?key|secret|token|credential)" SKILL.md
# Check for hardcoded paths
grep -E "(/Users/|/home/|C:\\\\)" SKILL.md
# Check for sensitive file extensions
grep -E "\.(key|pem|cert|p12|pfx)( |$)" SKILL.md
Manual Review
Review each file for:
- No credentials in examples
- No personal information
- File paths are generic/relative
- Network access is documented
- External dependencies are from trusted sources
- Scripts don't make unsafe system changes
6. Cross-Skill Conflict Testing
If you have multiple skills installed:
- Similar domain overlap: Test that specific skills trigger (not generic ones)
- Keyword conflicts: Check if multiple skills trigger on same query
- Description clarity: Ensure each skill's domain is distinct
Example conflicts to avoid:
- "Python Helper" (too generic) vs "Python Testing with pytest" (specific)
- Both trigger on "Help with Python" → Fix by making descriptions more specific
Testing Workflows
Quick Test (5 minutes)
For minor updates or simple skills:
- ✓ Validate metadata (YAML, character limits)
- ✓ Check one example works
- ✓ Test one positive trigger
- ✓ Test one negative trigger
- ✓ Scan for secrets
Standard Test (15 minutes)
For new skills or significant changes:
- ✓ Complete metadata validation
- ✓ Test all examples
- ✓ Run 3-5 trigger tests (positive + negative)
- ✓ Check token efficiency
- ✓ Full security review
- ✓ Verify file references
Comprehensive Test (30+ minutes)
For complex skills or pre-release:
- ✓ All standard tests
- ✓ Test with different Claude models
- ✓ Test conflict scenarios with other skills
- ✓ Have someone else try the skill
- ✓ Test edge cases in examples
- ✓ Review progressive disclosure strategy
- ✓ Load test (simulate typical usage)
Common Issues and Fixes
Skill Doesn't Trigger
Symptoms: Claude doesn't load skill context when expected
Diagnose:
- Description too vague?
- Description missing trigger keywords?
- Name too generic?
Fix:
# Before
description: Python development helpers
# After
description: Create Python projects using Hatch and Hatchling for dependency management. Use when initializing new Python packages or configuring build systems.
Skill Triggers Too Often
Symptoms: Skill loads for unrelated queries
Diagnose:
- Description too broad?
- Keywords too common?
Fix:
# Add specificity and exclusions
description: Debug Swift applications using LLDB for crashes, memory issues, and runtime errors. Use when investigating Swift bugs or analyzing app behavior. NOT for general Swift coding or learning.
Examples Don't Work
Symptoms: Users can't reproduce examples
Diagnose:
- Missing prerequisites?
- Placeholders not explained?
- Environment-specific code?
Fix:
- Add prerequisites section
- Make examples self-contained
- Use generic paths and values
High Token Usage
Symptoms: Skill loads too much content
Diagnose:
- Too much in SKILL.md?
- No progressive disclosure?
- Verbose examples?
Fix:
- Split reference material to separate files
- Link to external resources
- Condense examples
- Move advanced content to on-demand files
Automated Testing (Advanced)
For repositories with many skills, consider automation:
Validate All Skills
#!/bin/bash
# validate-skills.sh
for skill_dir in */; do
if [ -f "$skill_dir/SKILL.md" ]; then
echo "Validating $skill_dir..."
# Check frontmatter exists
if ! grep -q "^---$" "$skill_dir/SKILL.md"; then
echo "❌ Missing YAML frontmatter"
fi
# Check name length
name=$(grep "^name:" "$skill_dir/SKILL.md" | sed 's/name: //')
if [ ${#name} -gt 64 ]; then
echo "❌ Name too long: ${#name} chars"
fi
# Check for secrets
if grep -qiE "(password|api[_-]?key|secret)" "$skill_dir/SKILL.md"; then
echo "⚠️ Potential secrets found"
fi
echo "✓ $skill_dir validated"
fi
done
CI/CD Integration
Add to GitHub Actions or similar:
name: Validate Skills
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run validation
run: |
chmod +x validate-skills.sh
./validate-skills.sh
Documentation Testing
Ensure documentation is accurate:
- Links work: All markdown links resolve
- Paths are correct: File references are accurate
- Examples are current: Code samples match latest versions
- Formatting is consistent: Markdown renders correctly
# Check for broken internal links
grep -r '\[.*\](.*\.md)' . | while read line; do
# Extract and verify file exists
# (implementation left as exercise)
done
User Acceptance Testing
The ultimate test is real usage:
- Give skill to others: Have colleagues test it
- Monitor usage: See when it triggers in practice
- Gather feedback: Ask users about clarity and usefulness
- Iterate: Refine based on real-world usage
Testing Checklist Template
Copy this for each skill you test:
# Testing Report: [Skill Name]
Date: [YYYY-MM-DD]
Tester: [Name]
## Metadata
- [ ] YAML valid
- [ ] Name ≤ 64 chars
- [ ] Description ≤ 1024 chars
- [ ] Trigger scenarios in description
## Content
- [ ] "When to Use" section present
- [ ] Examples runnable
- [ ] File references accurate
- [ ] No secrets
## Triggering
Positive tests:
1. [Scenario] - Result: [ ] Pass [ ] Fail
2. [Scenario] - Result: [ ] Pass [ ] Fail
Negative tests:
1. [Scenario] - Result: [ ] Pass [ ] Fail
2. [Scenario] - Result: [ ] Pass [ ] Fail
## Security
- [ ] No credentials
- [ ] No personal data
- [ ] Safe file access
- [ ] Dependencies verified
## Overall
- [ ] Ready for production
- [ ] Needs revision
- [ ] Rejected
Notes:
[Any additional observations]
Resources
- claude-skills/SKILL.md - Best practices guide
- claude-skills/checklist.md - Quality checklist
- CONTRIBUTING.md - Contribution guidelines
Remember: Testing isn't just about finding bugs—it's about ensuring your skill provides real value and triggers at the right time.