Files
claude-plugins/TESTING.md
movq 7911d90995 feat: Convert to Claude Code plugin marketplace
Transform repository into a plugin marketplace structure with two plugins:

- claude-code plugin: Complete toolkit with 5 skills
  * claude-code-plugins
  * claude-code-slash-commands
  * claude-code-hooks
  * claude-code-subagents
  * claude-code-memory

- claude-skills plugin: Meta-skill for creating Agent Skills
  * Comprehensive best practices guide
  * Templates and examples
  * Progressive disclosure patterns

Infrastructure:
- Add marketplace.json manifest
- Create plugin.json for each plugin
- Update documentation for marketplace structure
- Add contribution and testing guides

Installation:
- /plugin install claude-code@claude-skills
- /plugin install claude-skills@claude-skills

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-17 11:17:09 -05:00

10 KiB
Raw Permalink Blame History

Testing and Validating Skills

This guide helps you validate skills before adding them to the repository or using them in production.

Quick Validation Checklist

Run through this checklist before submitting a skill:

Metadata
[ ] SKILL.md exists
[ ] YAML frontmatter is valid
[ ] Name ≤ 64 characters
[ ] Description ≤ 1024 characters
[ ] Description includes trigger scenarios

Content Quality
[ ] "When to Use This Skill" section present
[ ] At least one concrete example
[ ] Examples are runnable/testable
[ ] File references are accurate
[ ] No sensitive data hardcoded

Triggering Tests
[ ] Triggers on target scenarios
[ ] Doesn't trigger on unrelated scenarios
[ ] No conflicts with similar skills

Security
[ ] No credentials or API keys
[ ] No personal information
[ ] Safe file system access only
[ ] External dependencies verified

Detailed Testing Process

1. Metadata Validation

Test YAML Parsing

Try parsing the frontmatter:

# Extract and validate YAML
head -n 10 SKILL.md | grep -A 3 "^---$"

Verify:

  • YAML is valid (no syntax errors)
  • Both name and description are present
  • Values are within character limits

Character Limits

# Count characters in name (must be ≤ 64)
grep "^name:" SKILL.md | sed 's/name: //' | wc -c

# Count characters in description (must be ≤ 1024)
grep "^description:" SKILL.md | sed 's/description: //' | wc -c

2. Content Quality Testing

Check Required Sections

# Verify "When to Use This Skill" section exists
grep -i "when to use" SKILL.md

# Verify examples exist
grep -i "example" SKILL.md

Test File References

If skill references other files, verify they exist:

# Find markdown links
grep -o '\[.*\]([^)]*\.md)' SKILL.md

# Check if referenced files exist
# (manually verify each one)

Validate Examples

For each example in the skill:

  1. Try running the code/commands
  2. Verify output matches expectations
  3. Check for edge cases
  4. Ensure examples are complete (no placeholders)

3. Trigger Testing

This is the most important validation step.

Create Test Scenarios

Positive Tests (SHOULD trigger)

Create a list of scenarios where the skill should activate:

Test Scenario 1: [Describe task that should trigger]
Expected: Skill activates
Actual: [Test result]

Test Scenario 2: [Another trigger case]
Expected: Skill activates
Actual: [Test result]

Negative Tests (SHOULD NOT trigger)

Create scenarios where the skill should NOT activate:

Test Scenario 3: [Similar but different task]
Expected: Skill does NOT activate
Actual: [Test result]

Test Scenario 4: [Unrelated task]
Expected: Skill does NOT activate
Actual: [Test result]

Example Testing Session

For a "Python Testing with pytest" skill:

Should Trigger:

  • "Help me write tests for my Python function"
  • "How do I use pytest fixtures?"
  • "Create unit tests for this class"

Should NOT Trigger:

  • "Help me test my JavaScript code" (different language)
  • "Debug my pytest installation" (installation, not testing)
  • "Explain what unit testing is" (concept, not implementation)

Run Tests with Claude

  1. Load the skill
  2. Ask Claude each test question
  3. Observe if skill triggers (check response for skill context)
  4. Document results

4. Token Efficiency Testing

Measure Content Size

# Count tokens (approximate: words × 1.3)
wc -w SKILL.md

# Or use a proper token counter
# (tokens ≈ characters ÷ 4 for rough estimate)
wc -c SKILL.md

Evaluate Split Points

Ask yourself:

  • Is content loaded only when needed?
  • Could mutually exclusive sections be split?
  • Are examples concise but complete?
  • Is reference material in separate files?

Target sizes:

  • SKILL.md: Under 3000 tokens (core workflows)
  • Additional files: Load only when referenced
  • Total metadata: ~100 tokens

5. Security Validation

Automated Checks

# Check for potential secrets
grep -iE "(password|api[_-]?key|secret|token|credential)" SKILL.md

# Check for hardcoded paths
grep -E "(/Users/|/home/|C:\\\\)" SKILL.md

# Check for sensitive file extensions
grep -E "\.(key|pem|cert|p12|pfx)( |$)" SKILL.md

Manual Review

Review each file for:

  • No credentials in examples
  • No personal information
  • File paths are generic/relative
  • Network access is documented
  • External dependencies are from trusted sources
  • Scripts don't make unsafe system changes

6. Cross-Skill Conflict Testing

If you have multiple skills installed:

  1. Similar domain overlap: Test that specific skills trigger (not generic ones)
  2. Keyword conflicts: Check if multiple skills trigger on same query
  3. Description clarity: Ensure each skill's domain is distinct

Example conflicts to avoid:

  • "Python Helper" (too generic) vs "Python Testing with pytest" (specific)
  • Both trigger on "Help with Python" → Fix by making descriptions more specific

Testing Workflows

Quick Test (5 minutes)

For minor updates or simple skills:

  1. ✓ Validate metadata (YAML, character limits)
  2. ✓ Check one example works
  3. ✓ Test one positive trigger
  4. ✓ Test one negative trigger
  5. ✓ Scan for secrets

Standard Test (15 minutes)

For new skills or significant changes:

  1. ✓ Complete metadata validation
  2. ✓ Test all examples
  3. ✓ Run 3-5 trigger tests (positive + negative)
  4. ✓ Check token efficiency
  5. ✓ Full security review
  6. ✓ Verify file references

Comprehensive Test (30+ minutes)

For complex skills or pre-release:

  1. ✓ All standard tests
  2. ✓ Test with different Claude models
  3. ✓ Test conflict scenarios with other skills
  4. ✓ Have someone else try the skill
  5. ✓ Test edge cases in examples
  6. ✓ Review progressive disclosure strategy
  7. ✓ Load test (simulate typical usage)

Common Issues and Fixes

Skill Doesn't Trigger

Symptoms: Claude doesn't load skill context when expected

Diagnose:

  1. Description too vague?
  2. Description missing trigger keywords?
  3. Name too generic?

Fix:

# Before
description: Python development helpers

# After
description: Create Python projects using Hatch and Hatchling for dependency management. Use when initializing new Python packages or configuring build systems.

Skill Triggers Too Often

Symptoms: Skill loads for unrelated queries

Diagnose:

  1. Description too broad?
  2. Keywords too common?

Fix:

# Add specificity and exclusions
description: Debug Swift applications using LLDB for crashes, memory issues, and runtime errors. Use when investigating Swift bugs or analyzing app behavior. NOT for general Swift coding or learning.

Examples Don't Work

Symptoms: Users can't reproduce examples

Diagnose:

  1. Missing prerequisites?
  2. Placeholders not explained?
  3. Environment-specific code?

Fix:

  • Add prerequisites section
  • Make examples self-contained
  • Use generic paths and values

High Token Usage

Symptoms: Skill loads too much content

Diagnose:

  1. Too much in SKILL.md?
  2. No progressive disclosure?
  3. Verbose examples?

Fix:

  • Split reference material to separate files
  • Link to external resources
  • Condense examples
  • Move advanced content to on-demand files

Automated Testing (Advanced)

For repositories with many skills, consider automation:

Validate All Skills

#!/bin/bash
# validate-skills.sh

for skill_dir in */; do
    if [ -f "$skill_dir/SKILL.md" ]; then
        echo "Validating $skill_dir..."

        # Check frontmatter exists
        if ! grep -q "^---$" "$skill_dir/SKILL.md"; then
            echo "❌ Missing YAML frontmatter"
        fi

        # Check name length
        name=$(grep "^name:" "$skill_dir/SKILL.md" | sed 's/name: //')
        if [ ${#name} -gt 64 ]; then
            echo "❌ Name too long: ${#name} chars"
        fi

        # Check for secrets
        if grep -qiE "(password|api[_-]?key|secret)" "$skill_dir/SKILL.md"; then
            echo "⚠️  Potential secrets found"
        fi

        echo "✓ $skill_dir validated"
    fi
done

CI/CD Integration

Add to GitHub Actions or similar:

name: Validate Skills
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run validation
        run: |
          chmod +x validate-skills.sh
          ./validate-skills.sh

Documentation Testing

Ensure documentation is accurate:

  1. Links work: All markdown links resolve
  2. Paths are correct: File references are accurate
  3. Examples are current: Code samples match latest versions
  4. Formatting is consistent: Markdown renders correctly
# Check for broken internal links
grep -r '\[.*\](.*\.md)' . | while read line; do
    # Extract and verify file exists
    # (implementation left as exercise)
done

User Acceptance Testing

The ultimate test is real usage:

  1. Give skill to others: Have colleagues test it
  2. Monitor usage: See when it triggers in practice
  3. Gather feedback: Ask users about clarity and usefulness
  4. Iterate: Refine based on real-world usage

Testing Checklist Template

Copy this for each skill you test:

# Testing Report: [Skill Name]

Date: [YYYY-MM-DD]
Tester: [Name]

## Metadata
- [ ] YAML valid
- [ ] Name ≤ 64 chars
- [ ] Description ≤ 1024 chars
- [ ] Trigger scenarios in description

## Content
- [ ] "When to Use" section present
- [ ] Examples runnable
- [ ] File references accurate
- [ ] No secrets

## Triggering
Positive tests:
1. [Scenario] - Result: [ ] Pass [ ] Fail
2. [Scenario] - Result: [ ] Pass [ ] Fail

Negative tests:
1. [Scenario] - Result: [ ] Pass [ ] Fail
2. [Scenario] - Result: [ ] Pass [ ] Fail

## Security
- [ ] No credentials
- [ ] No personal data
- [ ] Safe file access
- [ ] Dependencies verified

## Overall
- [ ] Ready for production
- [ ] Needs revision
- [ ] Rejected

Notes:
[Any additional observations]

Resources


Remember: Testing isn't just about finding bugs—it's about ensuring your skill provides real value and triggers at the right time.