Back to papers
QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes
Summary: LLM-driven QueryArtisan generates just-in-time data-manipulation code to enable natural-language ad-hoc queries directly over heterogeneous, schema-less data lakes using modality-aware operators. Integrates a cost-model optimizer to produce efficient operator plans, avoiding ETL/schemas and outperforming prior LLM and ETL approaches.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13780
- Venue
- VLDB
- Year
- 2025
- Pagerank
- 4.2294678e-05
- Overall Rank
- 9,961 | 30.71%
- DOI
-
10.14778/3705829.3705832
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 2 of 2 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 17 of 17 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 369 |
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation |
2024 |
VLDB |
0.0002547515 |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 610 |
Goods: Organizing Google's Datasets |
2016 |
SIGMOD |
0.00019232674 |
| 939 |
Data Lake Management: Challenges and Opportunities |
2019 |
VLDB |
0.00015187344 |
| 1,178 |
Table Union Search on Open Data |
2018 |
VLDB |
0.00013468118 |
| 1,187 |
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes |
2019 |
SIGMOD |
0.00013443639 |
| 1,277 |
The Data Civilizer System |
2017 |
CIDR |
0.00012879695 |
| 1,643 |
CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex |
2022 |
VLDB |
0.0001104256 |
| 1,664 |
On Multi-Column Foreign Key Discovery |
2010 |
VLDB |
0.00010976887 |
| 2,836 |
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning |
2023 |
VLDB |
8.0443826e-05 |
| 3,281 |
Constance: An Intelligent Data Lake System |
2016 |
SIGMOD |
7.2823287e-05 |
| 3,908 |
Progressive and Selective Merge: Computing Top-K with Ad-hoc Ranking Functions |
2007 |
SIGMOD |
6.6392878e-05 |
| 3,942 |
Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins |
2022 |
VLDB |
6.6114622e-05 |
| 4,859 |
Integrating Data Lake Tables |
2023 |
VLDB |
5.8732433e-05 |
| 4,958 |
Efficient Subgraph Search over Large Uncertain Graphs |
2011 |
VLDB |
5.8031038e-05 |
| 6,165 |
When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web |
2020 |
SIGMOD |
5.1728052e-05 |
| 7,643 |
Cross Modal Data Discovery over Structured and Unstructured Data Lakes |
2023 |
VLDB |
4.6901105e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 9,993 |
Leveraging Query Optimizers to Verify the Soundness of LLM-based Query Rewrites for Real-World Workloads, and More! |
2026 |
CIDR |
4.1945683e-05 |
| 7,020 |
LLM for Data Management |
2024 |
VLDB |
4.8595728e-05 |
| 10,897 |
Welding Natural Language Queries to Analytics IRs with LLMs |
2024 |
CIDR |
4.1945683e-05 |
| 8,736 |
Unveiling Challenges for LLMs in Enterprise Data Engineering |
2026 |
VLDB |
4.456315e-05 |
| 8,974 |
DataLoom: Simplifying Data Loading with LLMs |
2024 |
VLDB |
4.4184286e-05 |
| 5,840 |
Logical and Physical Optimizations for SQL Query Execution over Large Language Models |
2025 |
SIGMOD |
5.3042561e-05 |
| 8,488 |
Can Large Language Models Be Query Optimizer for Relational Databases? |
2026 |
SIGMOD |
4.4998609e-05 |
| 7,705 |
AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries |
2025 |
CIDR |
4.6730494e-05 |
| 1,116 |
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes |
2024 |
VLDB |
0.00013890154 |
| 10,797 |
A Demonstration of QueryArtisan: Real-Time Data Lake Analysis via Dynamically Generated Data Manipulation Code |
2025 |
VLDB |
4.1945683e-05 |