yzlnew/infra-skills
A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-performance systems.
Deep Analysis
Claude Code skill set for AI infrastructure development, focusing on GPU programming and large model training optimization
Core Features
Technical Implementation
- Focused on AI Infra vertical domain, covering GPU programming, memory optimization, distributed training core scenarios
- Supports multiple parallel strategy comparisons and MoE model memory calculation
- Develop high-performance GPU operators (matrix multiplication, Attention mechanisms, Swizzle layout optimization)
- Evaluate LLM training GPU memory requirements, select optimal parallel strategy
- Project still in active development, content mainly LLM-generated and not rigorously reviewed
AI Infrastructure Agent Skills
⚠️ WARNING
This project is under active development and heavily generated by LLMs without strict proofreading. Use with caution and verify all code before production use.
A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-performance systems.
Overview
This repository provides expert-level skills for AI infrastructure engineering tasks. Each skill packages domain knowledge, code examples, and best practices to transform Claude into a specialized developer for specific frameworks and tools.
Construction Methodology (Unless Otherwise Specified)
- Knowledge Gathering: Use Gemini DeepResearch to collect comprehensive, up-to-date information on target frameworks
- Skill Development: Transform research into structured skills using
skill-creatorin Claude Code - Validation: Test skill-generated code examples to ensure correctness
- Maintenance: Regular updates based on latest official documentation
Available Skills
TileLang Developer
Write high-performance GPU kernels using TileLang for NVIDIA, AMD, and Ascend hardware.
Capabilities:
- Matrix multiplication (GEMM) kernels
- FlashAttention implementations
- DeepSeek MLA operators
- Performance optimization (swizzle layouts, pipelining, warp specialization)
- Cross-platform kernel development
Status: ✅ Complete
Megatron Memory Estimator
Estimate GPU memory usage for Megatron-based MoE and dense models. Built upon megatron_memory_estimator.
Capabilities:
- Estimate memory from HuggingFace configs
- Support for MoE models (DeepSeek-V3, Qwen, etc.)
- Parallelism strategy comparison (TP/PP/EP/CP)
- Memory optimization recommendations
Status: ✅ Complete
SLIME User
Guide for using SLIME (LLM post-training framework for RL Scaling). Built upon THUDM/slime.
Capabilities:
- RL training setup and configuration (GRPO, GSPO, PPO, Reinforce++)
- Multi-turn tool calling and agent workflows
- Custom reward models and generation functions
- Megatron and FSDP backend configuration
- SGLang integration and optimization
- Dynamic sampling and partial rollout
- Multi-node distributed training
Status: ✅ Complete
Prompt to create this skill, with Sonnet 4.5:
Use skill-creator to create a skill called slime-user at this repo. slime is an LLM
post-training framework for RL Scaling. Its repo is https://github.com/THUDM/slime.
Skill creation procedure:
1. Git clone the latest repo
2. Analyze `docs/en`, understand basic structure and write a doc navigation guide for user
getting started or finding docs for advanced usage
3. Gather valuable examples from the docs and `examples` dir, write key ideas and script
path down for quick reference
4. Checkout some important source code, for example `slime/slime/utils/arguments.py` and
`slime/rollout/sglang_rollout.py`, provide its path and functions for a quick find.
Planned Skills
SGLang Developer
Development skill for SGLang (Structured Generation Language) runtime and optimization.
Planned capabilities:
- SGLang runtime configuration
- Custom sampling strategies
- Performance tuning for LLM inference
- Multi-GPU serving optimization
Status: 🚧 Planned
vLLM Developer
Skill for vLLM engine development and deployment.
Planned capabilities:
- PagedAttention implementation
- Custom scheduler development
- Multi-LoRA serving
- Quantization integration
Status: 🚧 Planned
Usage
Installing Skills
Skills are installed by placing the skill directory in Claude's skills path:
Natural Language:
Ask Claude Code directly: "Help me install skills from https://github.com/yzlnew/infra-skills"
Personal (across all projects):
# Clone and copy to personal skills directory
git clone https://github.com/yzlnew/infra-skills.git
mkdir -p ~/.claude/skills
cp -r infra-skills/tilelang-developer ~/.claude/skills/
cp -r infra-skills/megatron-memory-estimator ~/.claude/skills/
cp -r infra-skills/slime-user ~/.claude/skills/
Project-level (for repository collaborators):
# Clone and copy to project's skills directory
cd your-project
git clone https://github.com/yzlnew/infra-skills.git .claude/skills-repo
mkdir -p .claude/skills
cp -r .claude/skills-repo/tilelang-developer .claude/skills/
cp -r .claude/skills-repo/megatron-memory-estimator .claude/skills/
cp -r .claude/skills-repo/slime-user .claude/skills/
Skills automatically activate when relevant tasks are detected.
Examples
TileLang Kernel Development:
# User request:
"Write a FP16 matrix multiplication kernel optimized for A100"
# Claude loads tilelang-developer skill and generates:
# - Complete TileLang kernel code
# - Performance optimizations (swizzle, pipelining)
# - Testing code
# - Hardware-specific tuning recommendations
Megatron Memory Estimation:
# User request:
"Estimate memory for DeepSeek-V3 with TP=8, PP=4, EP=8"
# Claude loads megatron-memory-estimator skill and provides:
# - Detailed memory breakdown (model, optimizer, activations)
# - Comparison across different parallelism strategies
# - Memory optimization recommendations
# - Hardware configuration suggestions
SLIME RL Training Setup:
# User request:
"Help me set up GRPO training for Qwen3-4B with multi-turn tool calling"
# Claude loads slime-user skill and provides:
# - Environment setup instructions
# - Custom generation function for tool calling
# - Training script configuration
# - Multi-node scaling guidance
Development
Testing Skills
Validate code examples in skills:
# Run all tests from project root
pytest
# Run tests for specific skill
pytest tests/tilelang-developer/
# Run specific test file
pytest tests/tilelang-developer/test_gemm.py
Updating Skills
When frameworks release major updates:
- Update skill source files (SKILL.md, references/) with latest information
- Run validation tests to ensure examples are correct
- Commit and tag new version
Quality Standards
All skills must meet these criteria:
- ✅ Accurate: Code examples must be tested and correct
- ✅ Concise: Follow progressive disclosure (SKILL.md < 500 lines)
- ✅ Complete: Include workflow, API reference, examples, and debugging
- ✅ Current: Based on latest stable framework version
- ✅ Clear: Explicit triggers in description for automatic activation
Contributing
Skill Requests
Open an issue with:
- Framework/tool name
- Use cases and scenarios
- Link to official documentation
Skill Improvements
- Fork the repository
- Update skill source files
- Run validation tests
- Submit PR with changelog
Roadmap
- [x] TileLang developer skill
- [x] Megatron memory estimator skill
- [x] SLIME user skill
- [ ] SGLang developer skill
- [ ] vLLM developer skill
- [ ] Automated testing pipeline
- [ ] Documentation update monitoring
- [ ] Skill versioning system
Resources
License
Skills are provided as-is for development purposes. Generated code follows the license terms of the underlying frameworks.
Note: This is a specialized repository for AI infrastructure developers. Skills contain advanced technical content and assume familiarity with GPU programming, compiler design, and deep learning systems.

