Back to papers
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
Summary: Evaporate: an LLM-based system that converts heterogeneous documents into queryable tables using in‑context learning rather than domain-specific training. Evaporate‑Code+ ensembles many synthesized extractors with weak supervision to approach/exceed direct extraction quality while using a sublinear LLM pass (≈110× fewer document calls).
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13765
- Venue
- VLDB
- Year
- 2024
- Pagerank
- 0.00013890154
- Overall Rank
- 1,116 | 92.24%
- DOI
-
10.14778/3626292.3626294
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 27 of 27 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,963 |
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing |
2025 |
VLDB |
9.929429e-05 |
| 2,106 |
Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing |
2025 |
CIDR |
9.5342543e-05 |
| 3,015 |
Chorus: Foundation Models for Unified Data Discovery and Exploration |
2024 |
VLDB |
7.7092391e-05 |
| 3,876 |
The Design of an LLM-powered Unstructured Analytics System |
2025 |
CIDR |
6.6741456e-05 |
| 5,509 |
Can Large Language Models Predict Data Correlations from Column Names? |
2023 |
VLDB |
5.4703368e-05 |
| 5,658 |
Databases Unbound: Querying All of the World's Bytes with AI |
2024 |
VLDB |
5.385675e-05 |
| 5,840 |
Logical and Physical Optimizations for SQL Query Execution over Large Language Models |
2025 |
SIGMOD |
5.3042561e-05 |
| 7,026 |
Mind the Data Gap: Bridging LLMs to Enterprise Data Integration |
2025 |
CIDR |
4.8570811e-05 |
| 7,705 |
AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries |
2025 |
CIDR |
4.6730494e-05 |
| 8,186 |
E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model |
2025 |
VLDB |
4.5651684e-05 |
| 8,204 |
ELEET: Efficient Learned Query Execution over Text and Tables |
2024 |
VLDB |
4.5594273e-05 |
| 8,469 |
Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS |
2025 |
VLDB |
4.5041113e-05 |
| 8,488 |
Can Large Language Models Be Query Optimizer for Relational Databases? |
2026 |
SIGMOD |
4.4998609e-05 |
| 8,520 |
mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs |
2025 |
VLDB |
4.4937074e-05 |
| 9,152 |
Doctopus: Budget-aware Structural Table Extraction from Unstructured Documents |
2025 |
VLDB |
4.3849295e-05 |
| 9,972 |
KathDB: Explainable Multimodal Database Management System with Human-AI Collaboration |
2026 |
CIDR |
4.1945683e-05 |
| 10,064 |
Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,115 |
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,126 |
Visual Template Inference for Data Extraction from Documents |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,215 |
Task Cascades for Efficient Unstructured Data Processing |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,438 |
Doctopus: A System for Budget-aware Structural Data Extraction from Unstructured Documents |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,455 |
Sentence to Model: Cost-Effective Data Collection LLM Agent |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,456 |
SwellDB: Dynamic Query-Driven Table Generation with Large Language Models |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,595 |
Optimized Batch Prompting for Cost-effective LLMs |
2025 |
VLDB |
4.1945683e-05 |
| 10,713 |
CoLA: Model Collaboration for Log-based Anomaly Detection |
2025 |
VLDB |
4.1945683e-05 |
| 10,752 |
QUEST: Query Optimization in Unstructured Document Analysis |
2025 |
VLDB |
4.1945683e-05 |
| 11,068 |
Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities |
2024 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 11 of 11 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 10,973 |
Unstructured Data Fusion for Schema and Data Extraction |
2024 |
SIGMOD |
4.1945683e-05 |
| 8,155 |
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study |
2024 |
SIGMOD |
4.5745248e-05 |
| 10,595 |
Optimized Batch Prompting for Cost-effective LLMs |
2025 |
VLDB |
4.1945683e-05 |
| 13,138 |
Database Perspective on LLM Inference Systems |
2025 |
VLDB |
- |
| 8,736 |
Unveiling Challenges for LLMs in Enterprise Data Engineering |
2026 |
VLDB |
4.456315e-05 |
| 7,705 |
AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries |
2025 |
CIDR |
4.6730494e-05 |
| 3,995 |
How Large Language Models Will Disrupt Data Management |
2023 |
VLDB |
6.5513237e-05 |
| 10,797 |
A Demonstration of QueryArtisan: Real-Time Data Lake Analysis via Dynamically Generated Data Manipulation Code |
2025 |
VLDB |
4.1945683e-05 |
| 9,961 |
QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes |
2025 |
VLDB |
4.2294678e-05 |
| 7,020 |
LLM for Data Management |
2024 |
VLDB |
4.8595728e-05 |