Fine-tuned 7B models now outperform zero-shot GPT-4 on knowledge graph completion, while a critical gap exists in benchmarks for dynamic ontology mutation—the very task most needed for continuous edge rearrangement in production systems. This comprehensive survey maps 47 benchmarks across relation extraction, link prediction, schema evolution, and multi-hop reasoning, revealing that optimal model selection depends heavily on the specific graph operation required.
The landscape divides into mature benchmark ecosystems (relation extraction, link prediction) and severely under-benchmarked areas (dynamic schema evolution). For practitioners building knowledge graph systems requiring low latency and high efficiency, Qwen-2.5-7B and Mistral-7B at Q4_K_M quantization offer the best Pareto frontier positions, achieving 85%+ of frontier model accuracy at ~4GB VRAM and 90+ tokens/second.
Relation extraction benchmarks reveal annotation quality crisis
Document-level and sentence-level relation extraction form the foundation of graph population from unstructured text. The benchmark landscape has matured significantly, but annotation quality issues have forced the community toward revised datasets.
DocRED remains the primary document-level benchmark with 5,053 Wikipedia documents across 96 Wikidata relations, requiring reading multiple sentences to identify relationships. However, its substantial false negative rate led to Re-DocRED, which shows models gaining ~13 F1 points when trained on corrected annotations. Current state-of-the-art (DREEAM) achieves ~82 F1 on Re-DocRED, while direct LLM fine-tuning produces suboptimal results—hybrid approaches combining LLaMA-2 with relation classifiers (LMRC) significantly outperform pure LLM methods.
TACRED (106,264 sentence-level examples, 42 relation types) suffers from ~8% label errors affecting absolute F1 scores. Its revised versions—TACREV and Re-TACRED—now serve as preferred evaluation targets. On Re-TACRED, specialized models reach 91.6 F1, while RAG-enhanced approaches using Flan T5-XL achieve 86.6 F1 on original TACRED.
| Benchmark | Scale | Relations | Best Model | F1 Score |
|---|---|---|---|---|
| Re-DocRED | 5K docs | 96 | DREEAM | ~82 |
| Re-TACRED | 106K sent | Redesigned | CTL-DRP | 91.6 |
| TACREV | 106K sent | 42 | RAG4RE (Flan T5-XL) | 88.3 |
| FewRel (5-way 5-shot) | 70K sent | 100 | BERT-PAIR | ~98% acc |
| SemEval-2010 Task 8 | 10.7K sent | 10 | Entity-Centric DT | 90.5 |
Small model performance varies dramatically by approach. Mistral-7B with task-incremental fine-tuning achieves 95.8% accuracy on seen TACRED tasks, while Flan T5 with retrieval augmentation (RAG4RE) delivers the best LLM-based results. Zero-shot LLMs still trail specialized architectures by 10-20 F1 points on document-level extraction, making fine-tuning essential for production deployment.
Link prediction benchmarks show LLMs catching up to embeddings
Knowledge graph completion benchmarks have evolved from simple Freebase subsets to million-scale Wikidata evaluations. The key insight: fine-tuned LLaMA-7B now outperforms ChatGPT and matches GPT-4 on triple classification, fundamentally changing the cost-performance calculus.
FB15k-237 (14,541 entities, 237 relations) and WN18RR (40,943 entities, 11 relations) remain standard transductive benchmarks. Recent advances like KG-FIT (NeurIPS 2024) and SR-GNN achieve +10% MRR improvement over baselines through combining LLM-refined hierarchies with graph embeddings.
For large-scale evaluation, ogbl-wikikg2 (2.5 million entities, 17 million edges) tests scalability—the current SOTA (RelEns) achieves 0.739 MRR, while traditional embeddings like RotatE plateau at 0.433 MRR. This 70% improvement from ensemble methods underscores the importance of hybrid approaches.
The Inductive Link Prediction Challenge (ILPC 2022) tests zero-shot generalization to entirely new entities—critical for dynamic knowledge graphs. Baseline performance remains modest (MRR ~0.14), but GPT-4 with ontology augmentation reaches 0.152 Hits@1, suggesting LLMs offer unique inductive capabilities that traditional embeddings lack.
KG-LLM Framework results demonstrate the fine-tuning opportunity:
| Model | Triple Classification (WN11/FB13) | Relation Prediction (YAGO3-10) |
|---|---|---|
| KG-LLaMA2-13B (fine-tuned) | 96.6% / 90.7% | 70.28% Hits@1 |
| GPT-4 (zero-shot) | ~94% (100-sample) | 56% Hits@1 |
| ChatGPT (zero-shot) | ~90% (100-sample) | 39% Hits@1 |
| LLaMA-7B (zero-shot) | 21.1% / 9.1% | — |
The 6-7x improvement from fine-tuning LLaMA-7B versus zero-shot—combined with dramatically lower inference costs—makes specialized fine-tuning the clear winner for production link prediction systems.
Dynamic ontology benchmarks face a critical gap
No dedicated benchmarks exist for testing real-time schema evolution handling—the most significant finding of this survey. While robust evaluation exists for static ontology tasks, the benchmark ecosystem fails to address how models adapt when ontologies change during operation.
Existing temporal knowledge graph benchmarks (ICEWS, GDELT, YAGO temporal variants) test temporal facts but not temporal changes to the schema itself. Models achieve ~45% MRR on ICEWS cross-dataset evaluation and ~63% MRR on YAGO temporal subsets, but these measure temporal reasoning over static schemas.
Structured output benchmarks partially fill the gap. JSONSchemaBench (January 2025) evaluates constrained decoding across ~10,000 real-world JSON schemas in 10 categories. Best frameworks support 2x more schemas than worst performers, and constrained decoding improves task accuracy by up to 4%. However, this tests schema adherence, not schema evolution.
OAEI (Ontology Alignment Evaluation Initiative) provides the closest analog to schema evolution testing through ontology matching tracks:
- Anatomy track: 2,744 vs 3,304 class alignments
- Bio-ML track: Machine learning-friendly biomedical matching
- Complex Correspondences: Beyond simple 1:1 mappings
The newest benchmark, OntoURL (May 2025), evaluates LLM ontological understanding across 58,981 questions from 40 ontologies. Key finding: LLMs demonstrate strong understanding but struggle with reasoning and learning tasks—precisely the capabilities needed for dynamic ontology mutation.
Missing benchmark capabilities for edge rearrangement:
- Incremental ontology update handling (adding/removing/modifying concepts)
- Schema drift detection and awareness
- Backward compatibility with old data under new schemas
- Cross-version reasoning across multiple ontology states
- Real-time schema mutation during inference
Multi-hop reasoning separates production-ready from prototype models
Graph traversal benchmarks reveal that complexity scaling remains the primary challenge—performance drops from ~70% F1 at 2 hops to ~15% at 4+ hops, even for frontier models.
HotpotQA (112,779 Wikipedia Q&A pairs requiring 2+ hop reasoning) shows GPT-4o achieving ~70-80% F1 on bridge questions but degrading to ~15-25% F1 on 4-hop compositional questions. The human-machine gap remains substantial on complex reasoning.
MetaQA (407,000 questions across 1-3 hop subsets in the movie domain) demonstrates near-solved status: KnowledgeNavigator + LLaMA-2-70B achieves 99.5% on 2-hop tasks. However, this domain-constrained success doesn’t generalize.
GrailQA (64,331 Freebase questions with zero-shot generalization testing) provides the most rigorous evaluation: SOTA achieves 84.4% F1 overall, but zero-shot performance drops to 80.8% F1. The compositional generalization gap (i.i.d. vs zero-shot) measures true reasoning capability.
| Benchmark | Size | Hops | SOTA Performance | GPT-4 Range |
|---|---|---|---|---|
| HotpotQA | 113K | 2+ | 80%+ (specialized) | 70-80% F1 |
| MetaQA 2-hop | 407K | 2 | 99.5% | ~99% |
| MetaQA 3-hop | 407K | 3 | 95%+ | ~95% |
| GrailQA | 64K | 1-4 | 84.4% F1 | Not benchmarked |
| KQA Pro | 120K | Multi | 95.3% (vs 97.5% human) | — |
| CLUTRR | Variable | 2-10+ | GAT >> BERT | — |
GNN-RAG hybrid approaches represent a breakthrough: combining GNN retrieval with a 7B LLM achieves +14.5% Hits improvement over ToG+ChatGPT while using far fewer resources. This architecture—GNN for candidate retrieval, LLM for final generation—offers the optimal efficiency profile for production multi-hop systems.
Efficiency metrics define the deployment frontier
Memory bandwidth, not compute, determines inference speed. The H100 delivers 2.8x throughput over A100 at 1.7x cost increase, while quantization at Q4_K_M reduces memory 4x with minimal quality degradation.
Generation speed benchmarks (tokens/second on A100):
| Model | TensorRT-LLM | vLLM | llama.cpp |
|---|---|---|---|
| Mistral 7B | 93.6 | 89.7 | ~45 |
| LLaMA-2 7B | 92.2 | 89.7 | ~45 |
| Gemma 7B | 65.9 (TGI) | — | ~35 |
| LLaMA-2 13B | 52.6 | 49.2 | ~25 |
Quantization performance (GGUF on 7B models):
| Quantization | VRAM | Quality Loss | Speed Impact |
|---|---|---|---|
| Q8_0 | ~8 GB | Minimal | Baseline |
| Q6_K | ~6.5 GB | Negligible | ~10% faster |
| Q4_K_M | ~4.5 GB | <2% | ~25% faster |
| Q3_K_M | ~3.5 GB | 5-10% | ~15% faster |
| IQ3_S | ~3 GB | 15%+ | Significant degradation |
API latency comparison:
| Model | TTFT | Throughput | Context | Cost (per 1M tokens) |
|---|---|---|---|---|
| Gemini 1.5 Flash | <0.2s | High | 1M | $0.35 input |
| Claude 3 Haiku | Low | 165 tok/s | 200K | $0.25 input |
| GPT-4o Mini | Slowest | 80 tok/s | 128K | $0.15 input |
For local deployment targeting edge rearrangement, Qwen-2.5-7B with TensorRT-LLM quantized to Q4_K_M offers the optimal configuration: ~90 tok/s, ~4.5 GB VRAM, and strong structured output compliance on JSONSchemaBench.
The speed-intelligence Pareto frontier for KG operations
Mapping models across latency (x-axis) and task performance (y-axis) reveals distinct Pareto-optimal choices by operation type:
Relation Extraction Frontier:
- Optimal edge: Mistral-7B fine-tuned (95.8% TACRED, 93 tok/s)
- Balanced: Flan T5-XL + RAG (86.6% TACRED, moderate latency)
- Best accuracy: CTL-DRP specialized model (91.6% Re-TACRED, slower)
Link Prediction Frontier:
- Optimal edge: KG-LLaMA-7B (70.3% YAGO Hits@1, ~45 tok/s)
- Inductive capability: GPT-4 + ontology (15.2% ILPC, API latency)
- Maximum scale: RelEns on ogbl-wikikg2 (0.739 MRR, batch processing)
Multi-hop Reasoning Frontier:
- Optimal edge: GNN-RAG + 7B LLM (matches GPT-4 + ToG, 10x faster)
- Balanced: KnowledgeNavigator + LLaMA-2-70B (99.5% MetaQA 2-hop)
- Zero-shot: GPT-4o (70-80% HotpotQA, highest generalization)
Structured Output Frontier:
- Fastest: Claude 3 Haiku (165 tok/s, good compliance)
- Best compliance: GPT-4o Mini (82% MMLU, best structured output)
- Best local: Mistral-7B + constrained decoding (2x faster than GPT-3.5)
Recommended testing methodology for continuous edge rearrangement
Given the benchmark gap for dynamic ontology tasks, construct a composite evaluation protocol combining existing benchmarks with custom schema evolution tests:
Phase 1: Baseline capability assessment
- Relation extraction: Evaluate on Re-DocRED (document-level) and Re-TACRED (sentence-level)
- Link prediction: Test on FB15k-237 (transductive) and ILPC-Small (inductive)
- Structured output: JSONSchemaBench medium-hard tiers
- Multi-hop: GrailQA zero-shot split for compositional generalization
Phase 2: Dynamic schema stress testing (custom benchmarks required)
- Schema addition test: Add new relation types mid-inference, measure adaptation speed
- Schema removal test: Deprecate relations, verify graceful degradation
- Version drift test: Mix queries against schema v1 and v2, measure consistency
- Incremental update test: Stream ontology updates, measure latency impact
Phase 3: Efficiency profiling
- Measure tokens/second at Q4_K_M quantization
- Profile VRAM under continuous inference (memory leaks, KV cache growth)
- Batch throughput curves (1, 8, 32, 64 concurrent requests)
- Time-to-first-token under load
Recommended model evaluation order:
- Qwen-2.5-7B (Q4_K_M) — best efficiency + strong reasoning
- Mistral-7B (Q4_K_M) — best structured output compliance
- Phi-3.5-mini — smallest footprint with competitive accuracy
- GNN-RAG + 7B backbone — if multi-hop dominates workload
Gaps and emerging opportunities
Three critical gaps demand attention for knowledge graph practitioners:
Gap 1: No dynamic ontology mutation benchmarks exist. The literature acknowledges that “ontologies naturally co-evolve with their communities of use,” yet evaluation focuses exclusively on static schemas. Building a temporal ontology benchmark tracking schema versions over time would fill this need.
Gap 2: Inductive link prediction benchmarks remain immature. ILPC 2022 provides a starting point (baseline MRR ~0.14), but comprehensive evaluation of LLM inductive capabilities—essential for handling new entities in production graphs—requires expansion.
Gap 3: Continuous learning for KGs lacks standardization. The BeGin framework offers modular continual graph learning evaluation, but no established benchmark dataset for continual KG embedding exists—datasets are currently “sampled heuristically.”
Emerging benchmarks to monitor:
- OntoURL (May 2025) — most comprehensive ontology understanding evaluation
- CoDEx-Mul (2024) — multimodal KG completion with images and text
- Dynamic-KGQA — continuously updated QA sets addressing data contamination
- JSONSchemaBench — integrated with lm-evaluation-harness for structured output
Conclusion
The benchmark landscape for LLM knowledge graph operations has matured significantly for static tasks while leaving dynamic schema evolution critically under-evaluated. For production systems requiring continuous edge rearrangement, the evidence points toward fine-tuned 7B models (particularly KG-LLaMA, Mistral, and Qwen-2.5) operating at Q4_K_M quantization as the efficiency-optimal choice—achieving 85-95% of frontier model accuracy at 10-20x lower inference cost.
The most predictive benchmarks for real-world graph manipulation are: Re-DocRED (document-level extraction quality), ILPC (inductive generalization), GrailQA zero-shot (compositional reasoning), and JSONSchemaBench (structured output reliability). Custom schema evolution tests remain necessary until the community develops standardized dynamic ontology benchmarks.
GNN-LLM hybrid architectures (particularly GNN-RAG) represent the breakthrough approach for multi-hop reasoning, matching GPT-4 performance with 7B parameter models. For continuous knowledge graph updates, combining these architectures with efficient quantization offers the clearest path to production deployment.
Related
- dynamic-knowledge-graphs — Production architectures for living knowledge bases
- vibe-coding-infrastructure — Implementation patterns for AI-native development