~ welcome

Yulong

AI Infra · LLM · Agent · Engineer. From low-level GPU / NPU performance to end-to-end agent systems — expanding to a full-stack LLM & AI Infra craft.

心怀善意，虚怀若谷。 stay kind, stay humble.

scroll · stack

~ what i build

Stack × Craft

AI Infra · LLM Serving · Agent Systems · GPU / NPU Performance. From low-level GPU / NPU performance to LLM serving and agent systems — closing the full-stack large-model & AI Infra loop, one release at a time.

full-stack journey core / shipping bridging expanding

01 Data corpora · RAG · retrieval eval

02 Pre-train Megatron-LM · DeepSpeed

03 Fine-tune SFT · LoRA · DPO

04 Serve vLLM · SGLang · TRT-LLM

05 Agent MCP · Claude Code · tools

06 Eval agent eval · obs · bench

07 Ops k8s · NCCL · Nsight · perf

~/work · live session 00 · boot

P1AI Compilers & IR · core

LLVM
MLIR
TVM
Triton
torch.compile
Halide
AscendNPU IR
Bisheng
Multi-level Tiling
Auto-tuning
Dynamic Shape
XLA
IREE

P2GPU / NPU Kernels · core

CUDA
CUTLASS
cuDNN
FlashAttention
GEMM
Fused Norm
Fused Softmax
Attention Variants
DaVinci Operators
Stencil CUDA
FlashAttention-3

L3Distributed Training · expanding

PyTorch
NCCL · HCCL
Megatron-LM
DeepSpeed
FSDP / ZeRO
TP / PP / DP
CP / EP / SP
LoRA / QLoRA
RLHF / DPO

L4LLM Serving · core

vLLM
SGLang
TensorRT-LLM
Triton Inference Server
TGI
Continuous Batching
Paged KV Cache
Speculative Decoding
FP8 / INT4
Operator Fusion
Disaggregated Serving
Multi-modal Inference

L2Data & Vector · expanding

Spark
Delta Lake
Iceberg
dbt
DVC
Feast
Qdrant
Milvus
pgvector
Hybrid Retrieval

L6Agent & RAG · expanding

MCP
Claude Code
Function Calling
LangGraph
Planner / ReAct
RAG
Reranker
RAGAS
Agent Eval
FastAPI

L1 · L5Compute · MLOps · Observability

Linux
C++
Python
Docker
Kubernetes
NVLink / HBM
Nsight Systems
Nsight Compute
perf
Ray
KubeFlow
Slurm
MLflow
Prometheus
Grafana
Kubecost

/ 01

Serving LLMs on heterogeneous hardware

Bring large-model inference (Llama-class & beyond) onto CUDA and Ascend NPU paths — operator fusion, FlashAttention, paged KV cache, speculative decoding.

/ 02

Agent & tooling loops

Agentic workflows over MCP, Claude Code, and custom tool-calling protocols — reliable planners, evals, and observability for multi-turn agents.

/ 03

Full-stack ambition

Growing from infra-only to the full LLM lifecycle — data, training, fine-tuning, serving, agent, eval, ops. Each release closes one more gap in the loop.