yulong.wang
●
available for work
~ welcome
Yulong
AI Infra · LLM · Agent · Engineer.
From low-level GPU / NPU performance to end-to-end agent systems —
expanding to a full-stack LLM & AI Infra craft.
心怀善意,虚怀若谷。
> stay kind, stay humble.
Notes
notes.maxwi.com
GitHub
@blueyi
Email
yl.w@outlook.com
Resume LOCKED
passphrase required
scroll · stack
full-stack journey
core / shipping
bridging
expanding
01
Data
corpora · RAG · retrieval eval
02
Pre-train
Megatron-LM · DeepSpeed
03
Fine-tune
SFT · LoRA · DPO
04
Serve
vLLM · SGLang · TRT-LLM
05
Agent
MCP · Claude Code · tools
06
Eval
agent eval · obs · bench
07
Ops
k8s · NCCL · Nsight · perf
~/work · live session
00 · boot
P1 AI Compilers & IR · core
LLVM
MLIR
TVM
Triton
torch.compile
Halide
AscendNPU IR
Bisheng
Multi-level Tiling
Auto-tuning
Dynamic Shape
XLA
IREE
P2 GPU / NPU Kernels · core
CUDA
CUTLASS
cuDNN
FlashAttention
GEMM
Fused Norm
Fused Softmax
Attention Variants
DaVinci Operators
Stencil CUDA
FlashAttention-3
L3 Distributed Training · expanding
PyTorch
NCCL · HCCL
Megatron-LM
DeepSpeed
FSDP / ZeRO
TP / PP / DP
CP / EP / SP
LoRA / QLoRA
RLHF / DPO
L4 LLM Serving · core
vLLM
SGLang
TensorRT-LLM
Triton Inference Server
TGI
Continuous Batching
Paged KV Cache
Speculative Decoding
FP8 / INT4
Operator Fusion
Disaggregated Serving
Multi-modal Inference
L2 Data & Vector · expanding
Spark
Delta Lake
Iceberg
dbt
DVC
Feast
Qdrant
Milvus
pgvector
Hybrid Retrieval
L6 Agent & RAG · expanding
MCP
Claude Code
Function Calling
LangGraph
Planner / ReAct
RAG
Reranker
RAGAS
Agent Eval
FastAPI
L1 · L5 Compute · MLOps · Observability
Linux
C++
Python
Docker
Kubernetes
NVLink / HBM
Nsight Systems
Nsight Compute
perf
Ray
KubeFlow
Slurm
MLflow
Prometheus
Grafana
Kubecost
/ 01
Serving LLMs on heterogeneous hardware
Bring large-model inference (Llama-class & beyond) onto CUDA and Ascend NPU paths — operator fusion, FlashAttention, paged KV cache, speculative decoding.
/ 02
Agent & tooling loops
Agentic workflows over MCP, Claude Code, and custom tool-calling protocols — reliable planners, evals, and observability for multi-turn agents.
/ 03
Full-stack ambition
Growing from infra-only to the full LLM lifecycle — data, training, fine-tuning, serving, agent, eval, ops. Each release closes one more gap in the loop.
currently building
Multi-agent coding loops on top of MCP & Claude Code.
Faster decode path — TensorRT-LLM + speculative decoding on heterogeneous HW.
Full-stack dive — fine-tuning + RAG + agent eval, end to end on a single laptop.
OSS deep-dives — AscendNPU-IR, vLLM internals, FlashAttention-3.