Database Paper Browser

Back to papers

The Data Civilizer System

Summary: DATA CIVILIZER builds a linkage graph to discover relevant enterprise datasets and join paths, federates execution across heterogeneous stores via a polystore, and integrates cleaning into query processing. Adds a workflow engine for flexible, update-aware pipelines. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
303
Venue
CIDR
Year
2017
Pagerank
0.00012879695
Overall Rank
1,277 | 91.12%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 54 citing papers.

Rank Citing Paper Year Venue Pagerank
939 Data Lake Management: Challenges and Opportunities 2019 VLDB 0.00015187344
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
1,541 Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes 2023 CIDR 0.00011456579
1,644 Finding Related Tables in Data Lakes for Interactive Data Science 2020 SIGMOD 0.00011041787
1,831 Synthesizing Entity Matching Rules by Examples 2018 VLDB 0.00010384082
1,894 Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning 2020 VLDB 0.0001018378
2,209 Data Integration: After the Teenage Years 2017 PODS 9.2868035e-05
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
2,359 Data Market Platforms: Trading Data Assets to Solve Data Problems 2020 VLDB 8.9607667e-05
2,517 Annotating Columns with Pre-trained Language Models 2022 SIGMOD 8.6092139e-05
2,730 Open Data Integration 2018 VLDB 8.2126735e-05
2,968 Raha: A Configuration-Free Error Detection System 2019 SIGMOD 7.7985097e-05
3,252 Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks 2020 SIGMOD 7.3178277e-05
3,265 RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! - 2018 VLDB 7.3083672e-05
3,358 Organizing Data Lakes for Navigation 2020 SIGMOD 7.1784949e-05
3,467 Data Profiling – A Tutorial 2017 SIGMOD 7.069081e-05
3,824 Correlation Sketches for Approximate Join-Correlation Queries 2021 SIGMOD 6.7260705e-05
4,212 Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration 2023 SIGMOD 6.3555142e-05
4,595 Juneau: Data Lake Management for Jupyter 2019 VLDB 6.060188e-05
5,058 A Demo of the Data Civilizer System 2017 SIGMOD 5.7280139e-05
5,153 Horizon: Scalable Dependency-driven Data Cleaning 2021 VLDB 5.6607963e-05
5,179 SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints 2017 VLDB 5.6428428e-05
5,383 Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search 2021 VLDB 5.5393038e-05
5,794 Discovering Related Data At Scale 2021 VLDB 5.3245122e-05
6,187 Semi-Supervised Data Cleaning with Raha and Baran 2021 CIDR 5.1656857e-05
6,280 Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks 2023 VLDB 5.1290457e-05
6,360 High-Dimensional Vector Similarity Search: From Time Series to Deep Network Embeddings 2020 SIGMOD 5.0961051e-05
7,303 DICE: Data Discovery by Example 2021 VLDB 4.7684686e-05
7,311 The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development 2020 SIGMOD 4.7656884e-05
7,384 The VADA Architecture for Cost-Effective Data Wrangling 2017 SIGMOD 4.7445719e-05
7,411 ItemSuggest: A Data Management Platform for Machine Learned Ranking Services 2019 CIDR 4.7364436e-05
7,643 Cross Modal Data Discovery over Structured and Unstructured Data Lakes 2023 VLDB 4.6901105e-05
7,704 ExDRa: Exploratory Data Science on Federated Raw Data 2021 SIGMOD 4.6733838e-05
7,745 Crossing the finish line faster when paddling the Data Lake with KAYAK 2017 VLDB 4.6618625e-05
7,858 ConnectionLens: Finding Connections Across Heterogeneous Data Sources 2018 VLDB 4.6342491e-05
8,000 Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics 2019 VLDB 4.6092803e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,116 LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes 2024 VLDB 4.581507e-05
8,696 Effective Entity Augmentation By Querying External Data Sources 2023 VLDB 4.4660032e-05
8,729 OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs 2023 VLDB 4.4582221e-05
8,974 DataLoom: Simplifying Data Loading with LLMs 2024 VLDB 4.4184286e-05
9,253 Glean: Structured Extractions from Templatic Documents 2021 VLDB 4.3690661e-05
9,306 Debugging Large-Scale Data Science Pipelines using Dagger 2020 VLDB 4.3572942e-05
9,379 GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example 2023 SIGMOD 4.3462787e-05
9,412 Retrofitting GDPR Compliance onto Legacy Databases 2022 VLDB 4.3441378e-05
9,961 QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes 2025 VLDB 4.2294678e-05
10,291 Morphing-based Compression for Data-centric ML Pipelines 2026 VLDB 4.1945683e-05
10,610 Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation 2025 VLDB 4.1945683e-05
10,828 Buckaroo: A Direct Manipulation Visual Data Wrangler 2025 VLDB 4.1945683e-05
11,063 Searching Data Lakes for Nested and Joined Data 2024 VLDB 4.1945683e-05
Previous Page 1 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 13 of 13 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers