Table Union Search on Open Data

Summary: Proposes probabilistic table union search for Open Data to find unionable tables via shared attribute domains. Three models (set-domain, ontology semantic, NL-domain), a per-pair data-driven selector, and a distribution-aware union-size algorithm; open benchmark with scalable results to 1M+ attributes. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 11787
Venue: VLDB
Year: 2018
Pagerank: 0.00013458551
Overall Rank: 1,179 | 91.81%
DOI: 10.14778/3192965.3192973

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 46 of 46 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
936	Data Lake Management: Challenges and Opportunities	2019	VLDB	0.00015197838
1,185	JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes	2019	SIGMOD	0.00013432692
1,643	Finding Related Tables in Data Lakes for Interactive Data Science	2020	SIGMOD	0.00011031534
1,742	Auctus: A Dataset Search Engine for Data Discovery and Augmentation	2021	VLDB	0.00010695388
2,737	Open Data Integration	2018	VLDB	8.2053894e-05
2,842	Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning	2023	VLDB	8.0366354e-05
3,001	SANTOS: Relationship-based Semantic Table Union Search	2023	SIGMOD	7.739698e-05
3,326	DeepJoin: Joinable Table Discovery with Pre-trained Language Models	2023	VLDB	7.2148323e-05
3,360	Organizing Data Lakes for Navigation	2020	SIGMOD	7.1719486e-05
3,827	Correlation Sketches for Approximate Join-Correlation Queries	2021	SIGMOD	6.7195959e-05
3,968	Pytheas: Pattern-based Table Discovery in CSV Files	2020	VLDB	6.5777452e-05
4,589	Juneau: Data Lake Management for Jupyter	2019	VLDB	6.0565861e-05
4,860	Integrating Data Lake Tables	2023	VLDB	5.867964e-05
5,022	Towards Distribution-aware Query Answering in Data Markets	2022	VLDB	5.7479778e-05
5,386	Selective Data Acquisition in the Wild for Model Charging	2022	VLDB	5.5346315e-05
5,538	Data-Driven Domain Discovery for Structured Datasets	2020	VLDB	5.4520759e-05
5,965	Automatic Data Acquisition for Deep Learning	2021	VLDB	5.2476363e-05
5,982	Responsible Data Integration: Next-generation Challenges	2022	SIGMOD	5.2409386e-05
6,071	Observatory: Characterizing Embeddings of Relational Tables	2024	VLDB	5.2231739e-05
6,230	Mosaic: A Sample-Based Database System for Open World Query Processing	2020	CIDR	5.1402482e-05
6,268	MATE: Multi-Attribute Table Extraction	2022	VLDB	5.1288179e-05
6,432	RONIN: Data Lake Exploration	2021	VLDB	5.0571585e-05
6,445	Causal Data Integration	2023	VLDB	5.0539192e-05
6,897	TableDC: Deep Clustering for Tabular Data	2025	SIGMOD	4.8878659e-05
7,642	Cross Modal Data Discovery over Structured and Unstructured Data Lakes	2023	VLDB	4.6856127e-05
8,121	LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes	2024	VLDB	4.5771143e-05
8,194	WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses	2023	CIDR	4.5574831e-05
8,500	A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science	2021	VLDB	4.4920292e-05
8,616	Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular Data	2024	SIGMOD	4.4795277e-05
8,725	OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs	2023	VLDB	4.453957e-05
8,912	R2D2: Reducing Redundancy and Duplication in Data Lakes	2023	SIGMOD	4.4229886e-05
9,381	Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations	2024	SIGMOD	4.343902e-05
9,960	QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes	2025	VLDB	4.2254157e-05
10,109	Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations	2026	SIGMOD	4.1905499e-05
10,341	Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution	2026	VLDB	4.1905499e-05
10,519	Table Overlap Estimation through Graph Embeddings	2025	SIGMOD	4.1905499e-05
10,597	Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index	2025	VLDB	4.1905499e-05
10,603	Optimized Batch Prompting for Cost-effective LLMs	2025	VLDB	4.1905499e-05
10,653	OpenForge: Probabilistic Metadata Integration	2025	VLDB	4.1905499e-05
10,693	LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes	2025	VLDB	4.1905499e-05
10,760	OmniMatch: Joinability Discovery in Data Products	2025	VLDB	4.1905499e-05
10,840	Data Discovery in Data Lakes: Operations, Indexes, Systems	2025	VLDB	4.1905499e-05
10,954	Determining the Largest Overlap between Tables	2024	SIGMOD	4.1905499e-05
11,057	Enriching Relations with Additional Attributes for ER	2024	VLDB	4.1905499e-05
11,100	Navigating Data Repositories: Utilizing Line Charts to Discover Relevant Datasets	2024	VLDB	4.1905499e-05
11,532	Valentine in Action: Matching Tabular Data at Scale	2021	VLDB	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 13 of 13 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
108	WebTables: Exploring the Power of Tables on the Web	2008	VLDB	0.00048345996
365	Annotating and Searching Web Tables Using Entities, Types and Relationships	2010	VLDB	0.00025616694
420	InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables	2012	SIGMOD	0.00023700634
519	Data Integration for the Relational Web	2009	VLDB	0.00021148006
814	Finding Related Tables	2012	SIGMOD	0.00016298739
899	Statistical Schema Matching across Web Query Interfaces	2003	SIGMOD	0.00015476061
915	On Schema Matching with Opaque Column Names and Data Values	2003	SIGMOD	0.00015362622
1,005	Recovering Semantics of Tables on the Web	2011	VLDB	0.00014694038
1,370	Answering Table Queries on the Web using Column Keywords	2012	VLDB	0.00012339543
2,142	LSH Ensemble: Internet-Scale Domain Search	2016	VLDB	9.4461701e-05
3,798	Stitching Web Tables for Improving Matching Quality	2017	VLDB	6.753528e-05
3,993	Discovering Linkage Points over Web Data	2013	VLDB	6.5483476e-05
5,586	HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching	2009	VLDB	5.4194791e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
365	Annotating and Searching Web Tables Using Entities, Types and Relationships	2010	VLDB	0.00025616694
1,585	Answering Table Augmentation Queries from Unstructured Lists on the Web	2009	VLDB	0.00011245609
11,066	Searching Data Lakes for Nested and Joined Data	2024	VLDB	4.1905499e-05
3,326	DeepJoin: Joinable Table Discovery with Pre-trained Language Models	2023	VLDB	7.2148323e-05
1,005	Recovering Semantics of Tables on the Web	2011	VLDB	0.00014694038
8,121	LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes	2024	VLDB	4.5771143e-05
5,538	Data-Driven Domain Discovery for Structured Datasets	2020	VLDB	5.4520759e-05
2,842	Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning	2023	VLDB	8.0366354e-05
3,001	SANTOS: Relationship-based Semantic Table Union Search	2023	SIGMOD	7.739698e-05
814	Finding Related Tables	2012	SIGMOD	0.00016298739