Benchmark of Benchmarks for LLM Knowledge Graph Operations

Fine-tuned 7B models now outperform zero-shot GPT-4 on knowledge graph completion, while a critical gap exists in benchmarks for dynamic ontology mutation—the very task most needed for continuous edge rearrangement in production systems. This comprehensive survey maps 47 benchmarks across relation extraction, link prediction, schema evolution, and multi-hop reasoning, revealing that optimal model selection depends heavily on the specific graph operation required.

The landscape divides into mature benchmark ecosystems (relation extraction, link prediction) and severely under-benchmarked areas (dynamic schema evolution). For practitioners building knowledge graph systems requiring low latency and high efficiency, Qwen-2.5-7B and Mistral-7B at Q4_K_M quantization offer the best Pareto frontier positions, achieving 85%+ of frontier model accuracy at ~4GB VRAM and 90+ tokens/second.

Relation extraction benchmarks reveal annotation quality crisis

Document-level and sentence-level relation extraction form the foundation of graph population from unstructured text. The benchmark landscape has matured significantly, but annotation quality issues have forced the community toward revised datasets.

DocRED remains the primary document-level benchmark with 5,053 Wikipedia documents across 96 Wikidata relations, requiring reading multiple sentences to identify relationships. However, its substantial false negative rate led to Re-DocRED, which shows models gaining ~13 F1 points when trained on corrected annotations. Current state-of-the-art (DREEAM) achieves ~82 F1 on Re-DocRED, while direct LLM fine-tuning produces suboptimal results—hybrid approaches combining LLaMA-2 with relation classifiers (LMRC) significantly outperform pure LLM methods.

TACRED (106,264 sentence-level examples, 42 relation types) suffers from ~8% label errors affecting absolute F1 scores. Its revised versions—TACREV and Re-TACRED—now serve as preferred evaluation targets. On Re-TACRED, specialized models reach 91.6 F1, while RAG-enhanced approaches using Flan T5-XL achieve 86.6 F1 on original TACRED.

Benchmark	Scale	Relations	Best Model	F1 Score
Re-DocRED	5K docs	96	DREEAM	~82
Re-TACRED	106K sent	Redesigned	CTL-DRP	91.6
TACREV	106K sent	42	RAG4RE (Flan T5-XL)	88.3
FewRel (5-way 5-shot)	70K sent	100	BERT-PAIR	~98% acc
SemEval-2010 Task 8	10.7K sent	10	Entity-Centric DT	90.5

Small model performance varies dramatically by approach. Mistral-7B with task-incremental fine-tuning achieves 95.8% accuracy on seen TACRED tasks, while Flan T5 with retrieval augmentation (RAG4RE) delivers the best LLM-based results. Zero-shot LLMs still trail specialized architectures by 10-20 F1 points on document-level extraction, making fine-tuning essential for production deployment.

Link prediction benchmarks show LLMs catching up to embeddings

Knowledge graph completion benchmarks have evolved from simple Freebase subsets to million-scale Wikidata evaluations. The key insight: fine-tuned LLaMA-7B now outperforms ChatGPT and matches GPT-4 on triple classification, fundamentally changing the cost-performance calculus.

FB15k-237 (14,541 entities, 237 relations) and WN18RR (40,943 entities, 11 relations) remain standard transductive benchmarks. Recent advances like KG-FIT (NeurIPS 2024) and SR-GNN achieve +10% MRR improvement over baselines through combining LLM-refined hierarchies with graph embeddings.

For large-scale evaluation, ogbl-wikikg2 (2.5 million entities, 17 million edges) tests scalability—the current SOTA (RelEns) achieves 0.739 MRR, while traditional embeddings like RotatE plateau at 0.433 MRR. This 70% improvement from ensemble methods underscores the importance of hybrid approaches.

The Inductive Link Prediction Challenge (ILPC 2022) tests zero-shot generalization to entirely new entities—critical for dynamic knowledge graphs. Baseline performance remains modest (MRR ~0.14), but GPT-4 with ontology augmentation reaches 0.152 Hits@1, suggesting LLMs offer unique inductive capabilities that traditional embeddings lack.

KG-LLM Framework results demonstrate the fine-tuning opportunity:

Model	Triple Classification (WN11/FB13)	Relation Prediction (YAGO3-10)
KG-LLaMA2-13B (fine-tuned)	96.6% / 90.7%	70.28% Hits@1
GPT-4 (zero-shot)	~94% (100-sample)	56% Hits@1
ChatGPT (zero-shot)	~90% (100-sample)	39% Hits@1
LLaMA-7B (zero-shot)	21.1% / 9.1%	—

The 6-7x improvement from fine-tuning LLaMA-7B versus zero-shot—combined with dramatically lower inference costs—makes specialized fine-tuning the clear winner for production link prediction systems.

Dynamic ontology benchmarks face a critical gap

No dedicated benchmarks exist for testing real-time schema evolution handling—the most significant finding of this survey. While robust evaluation exists for static ontology tasks, the benchmark ecosystem fails to address how models adapt when ontologies change during operation.

Existing temporal knowledge graph benchmarks (ICEWS, GDELT, YAGO temporal variants) test temporal facts but not temporal changes to the schema itself. Models achieve ~45% MRR on ICEWS cross-dataset evaluation and ~63% MRR on YAGO temporal subsets, but these measure temporal reasoning over static schemas.

Structured output benchmarks partially fill the gap. JSONSchemaBench (January 2025) evaluates constrained decoding across ~10,000 real-world JSON schemas in 10 categories. Best frameworks support 2x more schemas than worst performers, and constrained decoding improves task accuracy by up to 4%. However, this tests schema adherence, not schema evolution.

OAEI (Ontology Alignment Evaluation Initiative) provides the closest analog to schema evolution testing through ontology matching tracks:

Anatomy track: 2,744 vs 3,304 class alignments
Bio-ML track: Machine learning-friendly biomedical matching
Complex Correspondences: Beyond simple 1:1 mappings

The newest benchmark, OntoURL (May 2025), evaluates LLM ontological understanding across 58,981 questions from 40 ontologies. Key finding: LLMs demonstrate strong understanding but struggle with reasoning and learning tasks—precisely the capabilities needed for dynamic ontology mutation.

Missing benchmark capabilities for edge rearrangement:

Incremental ontology update handling (adding/removing/modifying concepts)
Schema drift detection and awareness
Backward compatibility with old data under new schemas
Cross-version reasoning across multiple ontology states
Real-time schema mutation during inference

Multi-hop reasoning separates production-ready from prototype models

Graph traversal benchmarks reveal that complexity scaling remains the primary challenge—performance drops from ~70% F1 at 2 hops to ~15% at 4+ hops, even for frontier models.

HotpotQA (112,779 Wikipedia Q&A pairs requiring 2+ hop reasoning) shows GPT-4o achieving ~70-80% F1 on bridge questions but degrading to ~15-25% F1 on 4-hop compositional questions. The human-machine gap remains substantial on complex reasoning.

MetaQA (407,000 questions across 1-3 hop subsets in the movie domain) demonstrates near-solved status: KnowledgeNavigator + LLaMA-2-70B achieves 99.5% on 2-hop tasks. However, this domain-constrained success doesn’t generalize.

GrailQA (64,331 Freebase questions with zero-shot generalization testing) provides the most rigorous evaluation: SOTA achieves 84.4% F1 overall, but zero-shot performance drops to 80.8% F1. The compositional generalization gap (i.i.d. vs zero-shot) measures true reasoning capability.

Benchmark	Size	Hops	SOTA Performance	GPT-4 Range
HotpotQA	113K	2+	80%+ (specialized)	70-80% F1
MetaQA 2-hop	407K	2	99.5%	~99%
MetaQA 3-hop	407K	3	95%+	~95%
GrailQA	64K	1-4	84.4% F1	Not benchmarked
KQA Pro	120K	Multi	95.3% (vs 97.5% human)	—
CLUTRR	Variable	2-10+	GAT >> BERT	—

GNN-RAG hybrid approaches represent a breakthrough: combining GNN retrieval with a 7B LLM achieves +14.5% Hits improvement over ToG+ChatGPT while using far fewer resources. This architecture—GNN for candidate retrieval, LLM for final generation—offers the optimal efficiency profile for production multi-hop systems.

Efficiency metrics define the deployment frontier

Memory bandwidth, not compute, determines inference speed. The H100 delivers 2.8x throughput over A100 at 1.7x cost increase, while quantization at Q4_K_M reduces memory 4x with minimal quality degradation.

Generation speed benchmarks (tokens/second on A100):

Model	TensorRT-LLM	vLLM	llama.cpp
Mistral 7B	93.6	89.7	~45
LLaMA-2 7B	92.2	89.7	~45
Gemma 7B	65.9 (TGI)	—	~35
LLaMA-2 13B	52.6	49.2	~25

Quantization performance (GGUF on 7B models):

Quantization	VRAM	Quality Loss	Speed Impact
Q8_0	~8 GB	Minimal	Baseline
Q6_K	~6.5 GB	Negligible	~10% faster
Q4_K_M	~4.5 GB	<2%	~25% faster
Q3_K_M	~3.5 GB	5-10%	~15% faster
IQ3_S	~3 GB	15%+	Significant degradation

API latency comparison:

Model	TTFT	Throughput	Context	Cost (per 1M tokens)
Gemini 1.5 Flash	<0.2s	High	1M	$0.35 input
Claude 3 Haiku	Low	165 tok/s	200K	$0.25 input
GPT-4o Mini	Slowest	80 tok/s	128K	$0.15 input

For local deployment targeting edge rearrangement, Qwen-2.5-7B with TensorRT-LLM quantized to Q4_K_M offers the optimal configuration: ~90 tok/s, ~4.5 GB VRAM, and strong structured output compliance on JSONSchemaBench.

The speed-intelligence Pareto frontier for KG operations

Mapping models across latency (x-axis) and task performance (y-axis) reveals distinct Pareto-optimal choices by operation type:

Relation Extraction Frontier:

Optimal edge: Mistral-7B fine-tuned (95.8% TACRED, 93 tok/s)
Balanced: Flan T5-XL + RAG (86.6% TACRED, moderate latency)
Best accuracy: CTL-DRP specialized model (91.6% Re-TACRED, slower)

Link Prediction Frontier:

Optimal edge: KG-LLaMA-7B (70.3% YAGO Hits@1, ~45 tok/s)
Inductive capability: GPT-4 + ontology (15.2% ILPC, API latency)
Maximum scale: RelEns on ogbl-wikikg2 (0.739 MRR, batch processing)

Multi-hop Reasoning Frontier:

Optimal edge: GNN-RAG + 7B LLM (matches GPT-4 + ToG, 10x faster)
Balanced: KnowledgeNavigator + LLaMA-2-70B (99.5% MetaQA 2-hop)
Zero-shot: GPT-4o (70-80% HotpotQA, highest generalization)

Structured Output Frontier:

Fastest: Claude 3 Haiku (165 tok/s, good compliance)
Best compliance: GPT-4o Mini (82% MMLU, best structured output)
Best local: Mistral-7B + constrained decoding (2x faster than GPT-3.5)

Recommended testing methodology for continuous edge rearrangement

Given the benchmark gap for dynamic ontology tasks, construct a composite evaluation protocol combining existing benchmarks with custom schema evolution tests:

Phase 1: Baseline capability assessment

Relation extraction: Evaluate on Re-DocRED (document-level) and Re-TACRED (sentence-level)
Link prediction: Test on FB15k-237 (transductive) and ILPC-Small (inductive)
Structured output: JSONSchemaBench medium-hard tiers
Multi-hop: GrailQA zero-shot split for compositional generalization

Phase 2: Dynamic schema stress testing (custom benchmarks required)

Schema addition test: Add new relation types mid-inference, measure adaptation speed
Schema removal test: Deprecate relations, verify graceful degradation
Version drift test: Mix queries against schema v1 and v2, measure consistency
Incremental update test: Stream ontology updates, measure latency impact

Phase 3: Efficiency profiling

Measure tokens/second at Q4_K_M quantization
Profile VRAM under continuous inference (memory leaks, KV cache growth)
Batch throughput curves (1, 8, 32, 64 concurrent requests)
Time-to-first-token under load

Recommended model evaluation order:

Qwen-2.5-7B (Q4_K_M) — best efficiency + strong reasoning
Mistral-7B (Q4_K_M) — best structured output compliance
Phi-3.5-mini — smallest footprint with competitive accuracy
GNN-RAG + 7B backbone — if multi-hop dominates workload

Gaps and emerging opportunities

Three critical gaps demand attention for knowledge graph practitioners:

Gap 1: No dynamic ontology mutation benchmarks exist. The literature acknowledges that “ontologies naturally co-evolve with their communities of use,” yet evaluation focuses exclusively on static schemas. Building a temporal ontology benchmark tracking schema versions over time would fill this need.

Gap 2: Inductive link prediction benchmarks remain immature. ILPC 2022 provides a starting point (baseline MRR ~0.14), but comprehensive evaluation of LLM inductive capabilities—essential for handling new entities in production graphs—requires expansion.

Gap 3: Continuous learning for KGs lacks standardization. The BeGin framework offers modular continual graph learning evaluation, but no established benchmark dataset for continual KG embedding exists—datasets are currently “sampled heuristically.”

Emerging benchmarks to monitor:

OntoURL (May 2025) — most comprehensive ontology understanding evaluation
CoDEx-Mul (2024) — multimodal KG completion with images and text
Dynamic-KGQA — continuously updated QA sets addressing data contamination
JSONSchemaBench — integrated with lm-evaluation-harness for structured output

Conclusion

The benchmark landscape for LLM knowledge graph operations has matured significantly for static tasks while leaving dynamic schema evolution critically under-evaluated. For production systems requiring continuous edge rearrangement, the evidence points toward fine-tuned 7B models (particularly KG-LLaMA, Mistral, and Qwen-2.5) operating at Q4_K_M quantization as the efficiency-optimal choice—achieving 85-95% of frontier model accuracy at 10-20x lower inference cost.

The most predictive benchmarks for real-world graph manipulation are: Re-DocRED (document-level extraction quality), ILPC (inductive generalization), GrailQA zero-shot (compositional reasoning), and JSONSchemaBench (structured output reliability). Custom schema evolution tests remain necessary until the community develops standardized dynamic ontology benchmarks.

GNN-LLM hybrid architectures (particularly GNN-RAG) represent the breakthrough approach for multi-hop reasoning, matching GPT-4 performance with 7B parameter models. For continuous knowledge graph updates, combining these architectures with efficient quantization offers the clearest path to production deployment.

dynamic-knowledge-graphs — Production architectures for living knowledge bases
vibe-coding-infrastructure — Implementation patterns for AI-native development

Archive Fever

Benchmark of Benchmarks for LLM Knowledge Graph Operations

Relation extraction benchmarks reveal annotation quality crisis

Link prediction benchmarks show LLMs catching up to embeddings

Dynamic ontology benchmarks face a critical gap

Multi-hop reasoning separates production-ready from prototype models

Efficiency metrics define the deployment frontier

The speed-intelligence Pareto frontier for KG operations

Recommended testing methodology for continuous edge rearrangement

Gaps and emerging opportunities

Conclusion

Graph View

Table of Contents

Backlinks

Archive Fever

Benchmark of Benchmarks for LLM Knowledge Graph Operations

Relation extraction benchmarks reveal annotation quality crisis

Link prediction benchmarks show LLMs catching up to embeddings

Dynamic ontology benchmarks face a critical gap

Multi-hop reasoning separates production-ready from prototype models

Efficiency metrics define the deployment frontier

The speed-intelligence Pareto frontier for KG operations

Recommended testing methodology for continuous edge rearrangement

Gaps and emerging opportunities

Conclusion

Related

Graph View

Table of Contents

Backlinks