447 lines
10 KiB
Markdown
447 lines
10 KiB
Markdown
|
|
# Testing and Validating Skills
|
|||
|
|
|
|||
|
|
This guide helps you validate skills before adding them to the repository or using them in production.
|
|||
|
|
|
|||
|
|
## Quick Validation Checklist
|
|||
|
|
|
|||
|
|
Run through this checklist before submitting a skill:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Metadata
|
|||
|
|
[ ] SKILL.md exists
|
|||
|
|
[ ] YAML frontmatter is valid
|
|||
|
|
[ ] Name ≤ 64 characters
|
|||
|
|
[ ] Description ≤ 1024 characters
|
|||
|
|
[ ] Description includes trigger scenarios
|
|||
|
|
|
|||
|
|
Content Quality
|
|||
|
|
[ ] "When to Use This Skill" section present
|
|||
|
|
[ ] At least one concrete example
|
|||
|
|
[ ] Examples are runnable/testable
|
|||
|
|
[ ] File references are accurate
|
|||
|
|
[ ] No sensitive data hardcoded
|
|||
|
|
|
|||
|
|
Triggering Tests
|
|||
|
|
[ ] Triggers on target scenarios
|
|||
|
|
[ ] Doesn't trigger on unrelated scenarios
|
|||
|
|
[ ] No conflicts with similar skills
|
|||
|
|
|
|||
|
|
Security
|
|||
|
|
[ ] No credentials or API keys
|
|||
|
|
[ ] No personal information
|
|||
|
|
[ ] Safe file system access only
|
|||
|
|
[ ] External dependencies verified
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Detailed Testing Process
|
|||
|
|
|
|||
|
|
### 1. Metadata Validation
|
|||
|
|
|
|||
|
|
#### Test YAML Parsing
|
|||
|
|
|
|||
|
|
Try parsing the frontmatter:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Extract and validate YAML
|
|||
|
|
head -n 10 SKILL.md | grep -A 3 "^---$"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Verify:
|
|||
|
|
- YAML is valid (no syntax errors)
|
|||
|
|
- Both `name` and `description` are present
|
|||
|
|
- Values are within character limits
|
|||
|
|
|
|||
|
|
#### Character Limits
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Count characters in name (must be ≤ 64)
|
|||
|
|
grep "^name:" SKILL.md | sed 's/name: //' | wc -c
|
|||
|
|
|
|||
|
|
# Count characters in description (must be ≤ 1024)
|
|||
|
|
grep "^description:" SKILL.md | sed 's/description: //' | wc -c
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Content Quality Testing
|
|||
|
|
|
|||
|
|
#### Check Required Sections
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Verify "When to Use This Skill" section exists
|
|||
|
|
grep -i "when to use" SKILL.md
|
|||
|
|
|
|||
|
|
# Verify examples exist
|
|||
|
|
grep -i "example" SKILL.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Test File References
|
|||
|
|
|
|||
|
|
If skill references other files, verify they exist:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Find markdown links
|
|||
|
|
grep -o '\[.*\]([^)]*\.md)' SKILL.md
|
|||
|
|
|
|||
|
|
# Check if referenced files exist
|
|||
|
|
# (manually verify each one)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Validate Examples
|
|||
|
|
|
|||
|
|
For each example in the skill:
|
|||
|
|
1. Try running the code/commands
|
|||
|
|
2. Verify output matches expectations
|
|||
|
|
3. Check for edge cases
|
|||
|
|
4. Ensure examples are complete (no placeholders)
|
|||
|
|
|
|||
|
|
### 3. Trigger Testing
|
|||
|
|
|
|||
|
|
This is the most important validation step.
|
|||
|
|
|
|||
|
|
#### Create Test Scenarios
|
|||
|
|
|
|||
|
|
**Positive Tests (SHOULD trigger)**
|
|||
|
|
|
|||
|
|
Create a list of scenarios where the skill should activate:
|
|||
|
|
|
|||
|
|
```markdown
|
|||
|
|
Test Scenario 1: [Describe task that should trigger]
|
|||
|
|
Expected: Skill activates
|
|||
|
|
Actual: [Test result]
|
|||
|
|
|
|||
|
|
Test Scenario 2: [Another trigger case]
|
|||
|
|
Expected: Skill activates
|
|||
|
|
Actual: [Test result]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Negative Tests (SHOULD NOT trigger)**
|
|||
|
|
|
|||
|
|
Create scenarios where the skill should NOT activate:
|
|||
|
|
|
|||
|
|
```markdown
|
|||
|
|
Test Scenario 3: [Similar but different task]
|
|||
|
|
Expected: Skill does NOT activate
|
|||
|
|
Actual: [Test result]
|
|||
|
|
|
|||
|
|
Test Scenario 4: [Unrelated task]
|
|||
|
|
Expected: Skill does NOT activate
|
|||
|
|
Actual: [Test result]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Example Testing Session
|
|||
|
|
|
|||
|
|
For a "Python Testing with pytest" skill:
|
|||
|
|
|
|||
|
|
**Should Trigger:**
|
|||
|
|
- "Help me write tests for my Python function"
|
|||
|
|
- "How do I use pytest fixtures?"
|
|||
|
|
- "Create unit tests for this class"
|
|||
|
|
|
|||
|
|
**Should NOT Trigger:**
|
|||
|
|
- "Help me test my JavaScript code" (different language)
|
|||
|
|
- "Debug my pytest installation" (installation, not testing)
|
|||
|
|
- "Explain what unit testing is" (concept, not implementation)
|
|||
|
|
|
|||
|
|
#### Run Tests with Claude
|
|||
|
|
|
|||
|
|
1. Load the skill
|
|||
|
|
2. Ask Claude each test question
|
|||
|
|
3. Observe if skill triggers (check response for skill context)
|
|||
|
|
4. Document results
|
|||
|
|
|
|||
|
|
### 4. Token Efficiency Testing
|
|||
|
|
|
|||
|
|
#### Measure Content Size
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Count tokens (approximate: words × 1.3)
|
|||
|
|
wc -w SKILL.md
|
|||
|
|
|
|||
|
|
# Or use a proper token counter
|
|||
|
|
# (tokens ≈ characters ÷ 4 for rough estimate)
|
|||
|
|
wc -c SKILL.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Evaluate Split Points
|
|||
|
|
|
|||
|
|
Ask yourself:
|
|||
|
|
- Is content loaded only when needed?
|
|||
|
|
- Could mutually exclusive sections be split?
|
|||
|
|
- Are examples concise but complete?
|
|||
|
|
- Is reference material in separate files?
|
|||
|
|
|
|||
|
|
Target sizes:
|
|||
|
|
- **SKILL.md**: Under 3000 tokens (core workflows)
|
|||
|
|
- **Additional files**: Load only when referenced
|
|||
|
|
- **Total metadata**: ~100 tokens
|
|||
|
|
|
|||
|
|
### 5. Security Validation
|
|||
|
|
|
|||
|
|
#### Automated Checks
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check for potential secrets
|
|||
|
|
grep -iE "(password|api[_-]?key|secret|token|credential)" SKILL.md
|
|||
|
|
|
|||
|
|
# Check for hardcoded paths
|
|||
|
|
grep -E "(/Users/|/home/|C:\\\\)" SKILL.md
|
|||
|
|
|
|||
|
|
# Check for sensitive file extensions
|
|||
|
|
grep -E "\.(key|pem|cert|p12|pfx)( |$)" SKILL.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Manual Review
|
|||
|
|
|
|||
|
|
Review each file for:
|
|||
|
|
- [ ] No credentials in examples
|
|||
|
|
- [ ] No personal information
|
|||
|
|
- [ ] File paths are generic/relative
|
|||
|
|
- [ ] Network access is documented
|
|||
|
|
- [ ] External dependencies are from trusted sources
|
|||
|
|
- [ ] Scripts don't make unsafe system changes
|
|||
|
|
|
|||
|
|
### 6. Cross-Skill Conflict Testing
|
|||
|
|
|
|||
|
|
If you have multiple skills installed:
|
|||
|
|
|
|||
|
|
1. **Similar domain overlap**: Test that specific skills trigger (not generic ones)
|
|||
|
|
2. **Keyword conflicts**: Check if multiple skills trigger on same query
|
|||
|
|
3. **Description clarity**: Ensure each skill's domain is distinct
|
|||
|
|
|
|||
|
|
Example conflicts to avoid:
|
|||
|
|
- "Python Helper" (too generic) vs "Python Testing with pytest" (specific)
|
|||
|
|
- Both trigger on "Help with Python" → Fix by making descriptions more specific
|
|||
|
|
|
|||
|
|
## Testing Workflows
|
|||
|
|
|
|||
|
|
### Quick Test (5 minutes)
|
|||
|
|
|
|||
|
|
For minor updates or simple skills:
|
|||
|
|
|
|||
|
|
1. ✓ Validate metadata (YAML, character limits)
|
|||
|
|
2. ✓ Check one example works
|
|||
|
|
3. ✓ Test one positive trigger
|
|||
|
|
4. ✓ Test one negative trigger
|
|||
|
|
5. ✓ Scan for secrets
|
|||
|
|
|
|||
|
|
### Standard Test (15 minutes)
|
|||
|
|
|
|||
|
|
For new skills or significant changes:
|
|||
|
|
|
|||
|
|
1. ✓ Complete metadata validation
|
|||
|
|
2. ✓ Test all examples
|
|||
|
|
3. ✓ Run 3-5 trigger tests (positive + negative)
|
|||
|
|
4. ✓ Check token efficiency
|
|||
|
|
5. ✓ Full security review
|
|||
|
|
6. ✓ Verify file references
|
|||
|
|
|
|||
|
|
### Comprehensive Test (30+ minutes)
|
|||
|
|
|
|||
|
|
For complex skills or pre-release:
|
|||
|
|
|
|||
|
|
1. ✓ All standard tests
|
|||
|
|
2. ✓ Test with different Claude models
|
|||
|
|
3. ✓ Test conflict scenarios with other skills
|
|||
|
|
4. ✓ Have someone else try the skill
|
|||
|
|
5. ✓ Test edge cases in examples
|
|||
|
|
6. ✓ Review progressive disclosure strategy
|
|||
|
|
7. ✓ Load test (simulate typical usage)
|
|||
|
|
|
|||
|
|
## Common Issues and Fixes
|
|||
|
|
|
|||
|
|
### Skill Doesn't Trigger
|
|||
|
|
|
|||
|
|
**Symptoms**: Claude doesn't load skill context when expected
|
|||
|
|
|
|||
|
|
**Diagnose**:
|
|||
|
|
1. Description too vague?
|
|||
|
|
2. Description missing trigger keywords?
|
|||
|
|
3. Name too generic?
|
|||
|
|
|
|||
|
|
**Fix**:
|
|||
|
|
```yaml
|
|||
|
|
# Before
|
|||
|
|
description: Python development helpers
|
|||
|
|
|
|||
|
|
# After
|
|||
|
|
description: Create Python projects using Hatch and Hatchling for dependency management. Use when initializing new Python packages or configuring build systems.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Skill Triggers Too Often
|
|||
|
|
|
|||
|
|
**Symptoms**: Skill loads for unrelated queries
|
|||
|
|
|
|||
|
|
**Diagnose**:
|
|||
|
|
1. Description too broad?
|
|||
|
|
2. Keywords too common?
|
|||
|
|
|
|||
|
|
**Fix**:
|
|||
|
|
```yaml
|
|||
|
|
# Add specificity and exclusions
|
|||
|
|
description: Debug Swift applications using LLDB for crashes, memory issues, and runtime errors. Use when investigating Swift bugs or analyzing app behavior. NOT for general Swift coding or learning.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Examples Don't Work
|
|||
|
|
|
|||
|
|
**Symptoms**: Users can't reproduce examples
|
|||
|
|
|
|||
|
|
**Diagnose**:
|
|||
|
|
1. Missing prerequisites?
|
|||
|
|
2. Placeholders not explained?
|
|||
|
|
3. Environment-specific code?
|
|||
|
|
|
|||
|
|
**Fix**:
|
|||
|
|
- Add prerequisites section
|
|||
|
|
- Make examples self-contained
|
|||
|
|
- Use generic paths and values
|
|||
|
|
|
|||
|
|
### High Token Usage
|
|||
|
|
|
|||
|
|
**Symptoms**: Skill loads too much content
|
|||
|
|
|
|||
|
|
**Diagnose**:
|
|||
|
|
1. Too much in SKILL.md?
|
|||
|
|
2. No progressive disclosure?
|
|||
|
|
3. Verbose examples?
|
|||
|
|
|
|||
|
|
**Fix**:
|
|||
|
|
- Split reference material to separate files
|
|||
|
|
- Link to external resources
|
|||
|
|
- Condense examples
|
|||
|
|
- Move advanced content to on-demand files
|
|||
|
|
|
|||
|
|
## Automated Testing (Advanced)
|
|||
|
|
|
|||
|
|
For repositories with many skills, consider automation:
|
|||
|
|
|
|||
|
|
### Validate All Skills
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# validate-skills.sh
|
|||
|
|
|
|||
|
|
for skill_dir in */; do
|
|||
|
|
if [ -f "$skill_dir/SKILL.md" ]; then
|
|||
|
|
echo "Validating $skill_dir..."
|
|||
|
|
|
|||
|
|
# Check frontmatter exists
|
|||
|
|
if ! grep -q "^---$" "$skill_dir/SKILL.md"; then
|
|||
|
|
echo "❌ Missing YAML frontmatter"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# Check name length
|
|||
|
|
name=$(grep "^name:" "$skill_dir/SKILL.md" | sed 's/name: //')
|
|||
|
|
if [ ${#name} -gt 64 ]; then
|
|||
|
|
echo "❌ Name too long: ${#name} chars"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# Check for secrets
|
|||
|
|
if grep -qiE "(password|api[_-]?key|secret)" "$skill_dir/SKILL.md"; then
|
|||
|
|
echo "⚠️ Potential secrets found"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "✓ $skill_dir validated"
|
|||
|
|
fi
|
|||
|
|
done
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### CI/CD Integration
|
|||
|
|
|
|||
|
|
Add to GitHub Actions or similar:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
name: Validate Skills
|
|||
|
|
on: [push, pull_request]
|
|||
|
|
|
|||
|
|
jobs:
|
|||
|
|
validate:
|
|||
|
|
runs-on: ubuntu-latest
|
|||
|
|
steps:
|
|||
|
|
- uses: actions/checkout@v2
|
|||
|
|
- name: Run validation
|
|||
|
|
run: |
|
|||
|
|
chmod +x validate-skills.sh
|
|||
|
|
./validate-skills.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Documentation Testing
|
|||
|
|
|
|||
|
|
Ensure documentation is accurate:
|
|||
|
|
|
|||
|
|
1. **Links work**: All markdown links resolve
|
|||
|
|
2. **Paths are correct**: File references are accurate
|
|||
|
|
3. **Examples are current**: Code samples match latest versions
|
|||
|
|
4. **Formatting is consistent**: Markdown renders correctly
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check for broken internal links
|
|||
|
|
grep -r '\[.*\](.*\.md)' . | while read line; do
|
|||
|
|
# Extract and verify file exists
|
|||
|
|
# (implementation left as exercise)
|
|||
|
|
done
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## User Acceptance Testing
|
|||
|
|
|
|||
|
|
The ultimate test is real usage:
|
|||
|
|
|
|||
|
|
1. **Give skill to others**: Have colleagues test it
|
|||
|
|
2. **Monitor usage**: See when it triggers in practice
|
|||
|
|
3. **Gather feedback**: Ask users about clarity and usefulness
|
|||
|
|
4. **Iterate**: Refine based on real-world usage
|
|||
|
|
|
|||
|
|
## Testing Checklist Template
|
|||
|
|
|
|||
|
|
Copy this for each skill you test:
|
|||
|
|
|
|||
|
|
```markdown
|
|||
|
|
# Testing Report: [Skill Name]
|
|||
|
|
|
|||
|
|
Date: [YYYY-MM-DD]
|
|||
|
|
Tester: [Name]
|
|||
|
|
|
|||
|
|
## Metadata
|
|||
|
|
- [ ] YAML valid
|
|||
|
|
- [ ] Name ≤ 64 chars
|
|||
|
|
- [ ] Description ≤ 1024 chars
|
|||
|
|
- [ ] Trigger scenarios in description
|
|||
|
|
|
|||
|
|
## Content
|
|||
|
|
- [ ] "When to Use" section present
|
|||
|
|
- [ ] Examples runnable
|
|||
|
|
- [ ] File references accurate
|
|||
|
|
- [ ] No secrets
|
|||
|
|
|
|||
|
|
## Triggering
|
|||
|
|
Positive tests:
|
|||
|
|
1. [Scenario] - Result: [ ] Pass [ ] Fail
|
|||
|
|
2. [Scenario] - Result: [ ] Pass [ ] Fail
|
|||
|
|
|
|||
|
|
Negative tests:
|
|||
|
|
1. [Scenario] - Result: [ ] Pass [ ] Fail
|
|||
|
|
2. [Scenario] - Result: [ ] Pass [ ] Fail
|
|||
|
|
|
|||
|
|
## Security
|
|||
|
|
- [ ] No credentials
|
|||
|
|
- [ ] No personal data
|
|||
|
|
- [ ] Safe file access
|
|||
|
|
- [ ] Dependencies verified
|
|||
|
|
|
|||
|
|
## Overall
|
|||
|
|
- [ ] Ready for production
|
|||
|
|
- [ ] Needs revision
|
|||
|
|
- [ ] Rejected
|
|||
|
|
|
|||
|
|
Notes:
|
|||
|
|
[Any additional observations]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Resources
|
|||
|
|
|
|||
|
|
- [claude-skills/SKILL.md](./claude-skills/SKILL.md) - Best practices guide
|
|||
|
|
- [claude-skills/checklist.md](./claude-skills/checklist.md) - Quality checklist
|
|||
|
|
- [CONTRIBUTING.md](./CONTRIBUTING.md) - Contribution guidelines
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Remember**: Testing isn't just about finding bugs—it's about ensuring your skill provides real value and triggers at the right time.
|