yzlnew/infra-skills

A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-performance systems.

License:UnknownLanguage:Python483

Deep Analysis

Claude Code skill set for AI infrastructure development, focusing on GPU programming and large model training optimization

Core Features

Technical Implementation

Highlights
  • Focused on AI Infra vertical domain, covering GPU programming, memory optimization, distributed training core scenarios
  • Supports multiple parallel strategy comparisons and MoE model memory calculation
Use Cases
  • Develop high-performance GPU operators (matrix multiplication, Attention mechanisms, Swizzle layout optimization)
  • Evaluate LLM training GPU memory requirements, select optimal parallel strategy
Limitations
  • Project still in active development, content mainly LLM-generated and not rigorously reviewed
Tech Stack
TileLangMegatron-LMSLIME RL FrameworkClaude Code Skills

AI Infrastructure Agent Skills

⚠️ WARNING
This project is under active development and heavily generated by LLMs without strict proofreading. Use with caution and verify all code before production use.

A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-performance systems.

Overview

This repository provides expert-level skills for AI infrastructure engineering tasks. Each skill packages domain knowledge, code examples, and best practices to transform Claude into a specialized developer for specific frameworks and tools.

Construction Methodology (Unless Otherwise Specified)

  1. Knowledge Gathering: Use Gemini DeepResearch to collect comprehensive, up-to-date information on target frameworks
  2. Skill Development: Transform research into structured skills using skill-creator in Claude Code
  3. Validation: Test skill-generated code examples to ensure correctness
  4. Maintenance: Regular updates based on latest official documentation

Available Skills

TileLang Developer

Write high-performance GPU kernels using TileLang for NVIDIA, AMD, and Ascend hardware.

Capabilities:

  • Matrix multiplication (GEMM) kernels
  • FlashAttention implementations
  • DeepSeek MLA operators
  • Performance optimization (swizzle layouts, pipelining, warp specialization)
  • Cross-platform kernel development

Status: ✅ Complete

Megatron Memory Estimator

Estimate GPU memory usage for Megatron-based MoE and dense models. Built upon megatron_memory_estimator.

Capabilities:

  • Estimate memory from HuggingFace configs
  • Support for MoE models (DeepSeek-V3, Qwen, etc.)
  • Parallelism strategy comparison (TP/PP/EP/CP)
  • Memory optimization recommendations

Status: ✅ Complete

SLIME User

Guide for using SLIME (LLM post-training framework for RL Scaling). Built upon THUDM/slime.

Capabilities:

  • RL training setup and configuration (GRPO, GSPO, PPO, Reinforce++)
  • Multi-turn tool calling and agent workflows
  • Custom reward models and generation functions
  • Megatron and FSDP backend configuration
  • SGLang integration and optimization
  • Dynamic sampling and partial rollout
  • Multi-node distributed training

Status: ✅ Complete

Prompt to create this skill, with Sonnet 4.5:

Use skill-creator to create a skill called slime-user at this repo. slime is an LLM
post-training framework for RL Scaling. Its repo is https://github.com/THUDM/slime.

Skill creation procedure:

1. Git clone the latest repo
2. Analyze `docs/en`, understand basic structure and write a doc navigation guide for user
getting started or finding docs for advanced usage
3. Gather valuable examples from the docs and `examples` dir, write key ideas and script
path down for quick reference
4. Checkout some important source code, for example `slime/slime/utils/arguments.py` and
`slime/rollout/sglang_rollout.py`, provide its path and functions for a quick find.

Planned Skills

SGLang Developer

Development skill for SGLang (Structured Generation Language) runtime and optimization.

Planned capabilities:

  • SGLang runtime configuration
  • Custom sampling strategies
  • Performance tuning for LLM inference
  • Multi-GPU serving optimization

Status: 🚧 Planned

vLLM Developer

Skill for vLLM engine development and deployment.

Planned capabilities:

  • PagedAttention implementation
  • Custom scheduler development
  • Multi-LoRA serving
  • Quantization integration

Status: 🚧 Planned

Usage

Installing Skills

Skills are installed by placing the skill directory in Claude's skills path:

Natural Language:
Ask Claude Code directly: "Help me install skills from https://github.com/yzlnew/infra-skills"

Personal (across all projects):

# Clone and copy to personal skills directory
git clone https://github.com/yzlnew/infra-skills.git
mkdir -p ~/.claude/skills
cp -r infra-skills/tilelang-developer ~/.claude/skills/
cp -r infra-skills/megatron-memory-estimator ~/.claude/skills/
cp -r infra-skills/slime-user ~/.claude/skills/

Project-level (for repository collaborators):

# Clone and copy to project's skills directory
cd your-project
git clone https://github.com/yzlnew/infra-skills.git .claude/skills-repo
mkdir -p .claude/skills
cp -r .claude/skills-repo/tilelang-developer .claude/skills/
cp -r .claude/skills-repo/megatron-memory-estimator .claude/skills/
cp -r .claude/skills-repo/slime-user .claude/skills/

Skills automatically activate when relevant tasks are detected.

Examples

TileLang Kernel Development:

# User request:
"Write a FP16 matrix multiplication kernel optimized for A100"

# Claude loads tilelang-developer skill and generates:
# - Complete TileLang kernel code
# - Performance optimizations (swizzle, pipelining)
# - Testing code
# - Hardware-specific tuning recommendations

Megatron Memory Estimation:

# User request:
"Estimate memory for DeepSeek-V3 with TP=8, PP=4, EP=8"

# Claude loads megatron-memory-estimator skill and provides:
# - Detailed memory breakdown (model, optimizer, activations)
# - Comparison across different parallelism strategies
# - Memory optimization recommendations
# - Hardware configuration suggestions

SLIME RL Training Setup:

# User request:
"Help me set up GRPO training for Qwen3-4B with multi-turn tool calling"

# Claude loads slime-user skill and provides:
# - Environment setup instructions
# - Custom generation function for tool calling
# - Training script configuration
# - Multi-node scaling guidance

Development

Testing Skills

Validate code examples in skills:

# Run all tests from project root
pytest

# Run tests for specific skill
pytest tests/tilelang-developer/

# Run specific test file
pytest tests/tilelang-developer/test_gemm.py

Updating Skills

When frameworks release major updates:

  1. Update skill source files (SKILL.md, references/) with latest information
  2. Run validation tests to ensure examples are correct
  3. Commit and tag new version

Quality Standards

All skills must meet these criteria:

  • Accurate: Code examples must be tested and correct
  • Concise: Follow progressive disclosure (SKILL.md < 500 lines)
  • Complete: Include workflow, API reference, examples, and debugging
  • Current: Based on latest stable framework version
  • Clear: Explicit triggers in description for automatic activation

Contributing

Skill Requests

Open an issue with:

  • Framework/tool name
  • Use cases and scenarios
  • Link to official documentation

Skill Improvements

  1. Fork the repository
  2. Update skill source files
  3. Run validation tests
  4. Submit PR with changelog

Roadmap

  • [x] TileLang developer skill
  • [x] Megatron memory estimator skill
  • [x] SLIME user skill
  • [ ] SGLang developer skill
  • [ ] vLLM developer skill
  • [ ] Automated testing pipeline
  • [ ] Documentation update monitoring
  • [ ] Skill versioning system

Resources

License

Skills are provided as-is for development purposes. Generated code follows the license terms of the underlying frameworks.


Note: This is a specialized repository for AI infrastructure developers. Skills contain advanced technical content and assume familiarity with GPU programming, compiler design, and deep learning systems.