Deep Analysis

Claude Code skill set for AI infrastructure development, focusing on GPU programming and large model training optimization

Core Features

Technical Implementation

Highlights

Focused on AI Infra vertical domain, covering GPU programming, memory optimization, distributed training core scenarios
Supports multiple parallel strategy comparisons and MoE model memory calculation

Use Cases

Develop high-performance GPU operators (matrix multiplication, Attention mechanisms, Swizzle layout optimization)
Evaluate LLM training GPU memory requirements, select optimal parallel strategy

Limitations

Project still in active development, content mainly LLM-generated and not rigorously reviewed

Tech Stack

TileLangMegatron-LMSLIME RL FrameworkClaude Code Skills

README

View on GitHub

AI Infrastructure Agent Skills

Name: infra-skills
Rating: 3.8 (3 reviews)
Author: yzlnew

⚠️ WARNING
This project is under active development and heavily generated by LLMs without strict proofreading. Use with caution and verify all code before production use.

A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-performance systems.

Overview

This repository provides expert-level skills for AI infrastructure engineering tasks. Each skill packages domain knowledge, code examples, and best practices to transform Claude into a specialized developer for specific frameworks and tools.

Construction Methodology (Unless Otherwise Specified)

Knowledge Gathering: Use Gemini DeepResearch to collect comprehensive, up-to-date information on target frameworks
Skill Development: Transform research into structured skills using skill-creator in Claude Code
Validation: Test skill-generated code examples to ensure correctness
Maintenance: Regular updates based on latest official documentation

Available Skills

TileLang Developer

Write high-performance GPU kernels using TileLang for NVIDIA, AMD, and Ascend hardware.

Capabilities:

Matrix multiplication (GEMM) kernels
FlashAttention implementations
DeepSeek MLA operators
Performance optimization (swizzle layouts, pipelining, warp specialization)
Cross-platform kernel development

Status: ✅ Complete

Megatron Memory Estimator

Estimate GPU memory usage for Megatron-based MoE and dense models. Built upon megatron_memory_estimator.

Capabilities:

Estimate memory from HuggingFace configs
Support for MoE models (DeepSeek-V3, Qwen, etc.)
Parallelism strategy comparison (TP/PP/EP/CP)
Memory optimization recommendations

Status: ✅ Complete

SLIME User

Guide for using SLIME (LLM post-training framework for RL Scaling). Built upon THUDM/slime.

Capabilities:

RL training setup and configuration (GRPO, GSPO, PPO, Reinforce++)
Multi-turn tool calling and agent workflows
Custom reward models and generation functions
Megatron and FSDP backend configuration
SGLang integration and optimization
Dynamic sampling and partial rollout
Multi-node distributed training

Status: ✅ Complete

Prompt to create this skill, with Sonnet 4.5:

Use skill-creator to create a skill called slime-user at this repo. slime is an LLM
post-training framework for RL Scaling. Its repo is https://github.com/THUDM/slime.

Skill creation procedure:

1. Git clone the latest repo
2. Analyze `docs/en`, understand basic structure and write a doc navigation guide for user
getting started or finding docs for advanced usage
3. Gather valuable examples from the docs and `examples` dir, write key ideas and script
path down for quick reference
4. Checkout some important source code, for example `slime/slime/utils/arguments.py` and
`slime/rollout/sglang_rollout.py`, provide its path and functions for a quick find.

Planned Skills

SGLang Developer

Development skill for SGLang (Structured Generation Language) runtime and optimization.

Planned capabilities:

SGLang runtime configuration
Custom sampling strategies
Performance tuning for LLM inference
Multi-GPU serving optimization

Status: 🚧 Planned

vLLM Developer

Skill for vLLM engine development and deployment.

Planned capabilities:

PagedAttention implementation
Custom scheduler development
Multi-LoRA serving
Quantization integration

Status: 🚧 Planned

Usage

Installing Skills

Skills are installed by placing the skill directory in Claude's skills path:

Natural Language:
Ask Claude Code directly: "Help me install skills from https://github.com/yzlnew/infra-skills"

Personal (across all projects):

# Clone and copy to personal skills directory
git clone https://github.com/yzlnew/infra-skills.git
mkdir -p ~/.claude/skills
cp -r infra-skills/tilelang-developer ~/.claude/skills/
cp -r infra-skills/megatron-memory-estimator ~/.claude/skills/
cp -r infra-skills/slime-user ~/.claude/skills/

Project-level (for repository collaborators):

# Clone and copy to project's skills directory
cd your-project
git clone https://github.com/yzlnew/infra-skills.git .claude/skills-repo
mkdir -p .claude/skills
cp -r .claude/skills-repo/tilelang-developer .claude/skills/
cp -r .claude/skills-repo/megatron-memory-estimator .claude/skills/
cp -r .claude/skills-repo/slime-user .claude/skills/

Skills automatically activate when relevant tasks are detected.

Examples

TileLang Kernel Development:

# User request:
"Write a FP16 matrix multiplication kernel optimized for A100"

# Claude loads tilelang-developer skill and generates:
# - Complete TileLang kernel code
# - Performance optimizations (swizzle, pipelining)
# - Testing code
# - Hardware-specific tuning recommendations

Megatron Memory Estimation:

# User request:
"Estimate memory for DeepSeek-V3 with TP=8, PP=4, EP=8"

# Claude loads megatron-memory-estimator skill and provides:
# - Detailed memory breakdown (model, optimizer, activations)
# - Comparison across different parallelism strategies
# - Memory optimization recommendations
# - Hardware configuration suggestions

SLIME RL Training Setup:

# User request:
"Help me set up GRPO training for Qwen3-4B with multi-turn tool calling"

# Claude loads slime-user skill and provides:
# - Environment setup instructions
# - Custom generation function for tool calling
# - Training script configuration
# - Multi-node scaling guidance

Development

Testing Skills

Validate code examples in skills:

# Run all tests from project root
pytest

# Run tests for specific skill
pytest tests/tilelang-developer/

# Run specific test file
pytest tests/tilelang-developer/test_gemm.py

Updating Skills

When frameworks release major updates:

Update skill source files (SKILL.md, references/) with latest information
Run validation tests to ensure examples are correct
Commit and tag new version

Quality Standards

All skills must meet these criteria:

✅ Accurate: Code examples must be tested and correct
✅ Concise: Follow progressive disclosure (SKILL.md < 500 lines)
✅ Complete: Include workflow, API reference, examples, and debugging
✅ Current: Based on latest stable framework version
✅ Clear: Explicit triggers in description for automatic activation

Contributing

Skill Requests

Open an issue with:

Framework/tool name
Use cases and scenarios
Link to official documentation

Skill Improvements

Fork the repository
Update skill source files
Run validation tests
Submit PR with changelog

Roadmap

[x] TileLang developer skill
[x] Megatron memory estimator skill
[x] SLIME user skill
[ ] SGLang developer skill
[ ] vLLM developer skill
[ ] Automated testing pipeline
[ ] Documentation update monitoring
[ ] Skill versioning system

Resources

License

Skills are provided as-is for development purposes. Generated code follows the license terms of the underlying frameworks.

Note: This is a specialized repository for AI infrastructure developers. Skills contain advanced technical content and assume familiarity with GPU programming, compiler design, and deep learning systems.

🔍Deep Analysis

Core Features

🔧Technical Implementation

AI Infrastructure Agent Skills

Overview

Construction Methodology (Unless Otherwise Specified)

Available Skills

TileLang Developer

Megatron Memory Estimator

SLIME User

Planned Skills

SGLang Developer

vLLM Developer

Usage

Installing Skills

Examples

Development

Testing Skills

Updating Skills

Quality Standards

Contributing

Skill Requests

Skill Improvements

Roadmap

Resources

License

Deep Analysis

Technical Implementation