Back to papers
Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins
Summary: Ember enables no-code context enrichment via a general keyless-join operator. It learns task-specific embeddings with Transformer-based representations, builds a similarity index over these embeddings, and delivers up to 39% recall gains across five domains with minimal configuration.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12942
- Venue
- VLDB
- Year
- 2022
- Pagerank
- 6.6114622e-05
- Overall Rank
- 3,942 | 72.58%
- DOI
-
10.14778/3494124.3494149
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 9 of 9 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,643 |
CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex |
2022 |
VLDB |
0.0001104256 |
| 3,335 |
DeepJoin: Joinable Table Discovery with Pre-trained Language Models |
2023 |
VLDB |
7.2065006e-05 |
| 4,934 |
From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management |
2022 |
VLDB |
5.8198826e-05 |
| 6,737 |
Demonstrating GPT-DB: Generating Query-Specific and Customizable Code for SQL Processing with GPT-4 |
2023 |
VLDB |
4.9457488e-05 |
| 7,643 |
Cross Modal Data Discovery over Structured and Unstructured Data Lakes |
2023 |
VLDB |
4.6901105e-05 |
| 8,186 |
E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model |
2025 |
VLDB |
4.5651684e-05 |
| 9,961 |
QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes |
2025 |
VLDB |
4.2294678e-05 |
| 10,510 |
Table Overlap Estimation through Graph Embeddings |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,836 |
Data Discovery in Data Lakes: Operations, Indexes, Systems |
2025 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 21 of 21 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 206 |
Constructing an Interactive Natural Language Interface for Relational Databases |
2015 |
VLDB |
0.00034667032 |
| 221 |
Deep Entity Matching with Pre-Trained Language Models |
2021 |
VLDB |
0.00033121824 |
| 254 |
Snorkel: Rapid Training Data Creation with Weak Supervision |
2018 |
VLDB |
0.00030540555 |
| 300 |
Deep Learning for Entity Matching: A Design Space Exploration |
2018 |
SIGMOD |
0.00028441466 |
| 518 |
Data Integration for the Relational Web |
2009 |
VLDB |
0.00021158934 |
| 610 |
Goods: Organizing Google's Datasets |
2016 |
SIGMOD |
0.00019232674 |
| 712 |
Magellan: Toward Building Entity Matching Management Systems |
2016 |
VLDB |
0.00017732426 |
| 754 |
Distributed Representations of Tuples for Entity Resolution |
2018 |
VLDB |
0.00017117211 |
| 903 |
To Join or Not to Join? Thinking Twice about Joins before Feature Selection |
2016 |
SIGMOD |
0.0001547016 |
| 1,198 |
Crossing the Structure Chasm |
2003 |
CIDR |
0.00013366708 |
| 1,281 |
DataHub: Collaborative Data Science & Dataset Version Management at Scale |
2015 |
CIDR |
0.00012854744 |
| 1,463 |
ARDA: Automatic Relational Data Augmentation for Machine Learning |
2020 |
VLDB |
0.00011869295 |
| 1,751 |
Auctus: A Dataset Search Engine for Data Discovery and Augmentation |
2021 |
VLDB |
0.00010683295 |
| 3,640 |
Deep Learning for Blocking in Entity Matching: A Design Space Exploration |
2021 |
VLDB |
6.8891671e-05 |
| 4,129 |
Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers? |
2018 |
VLDB |
6.428887e-05 |
| 4,196 |
Overton: A Data System for Monitoring and Improving Machine-Learned Products |
2020 |
CIDR |
6.3686231e-05 |
| 5,058 |
A Demo of the Data Civilizer System |
2017 |
SIGMOD |
5.7280139e-05 |
| 5,434 |
Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples |
2021 |
SIGMOD |
5.5045402e-05 |
| 8,137 |
Customizable and Scalable Fuzzy Join for Big Data |
2019 |
VLDB |
4.5774794e-05 |
| 9,438 |
Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation |
2021 |
CIDR |
4.3425082e-05 |
| 11,629 |
Leveraging Organizational Resources to Adapt Models to New Data Modalities |
2020 |
VLDB |
4.1945683e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 6,796 |
InferDB: In-Database Machine Learning Inference Using Indexes |
2024 |
VLDB |
4.9241624e-05 |
| 3,640 |
Deep Learning for Blocking in Entity Matching: A Design Space Exploration |
2021 |
VLDB |
6.8891671e-05 |
| 10,325 |
KEN: An Execution Engine for Unstructured Database Systems |
2026 |
VLDB |
4.1945683e-05 |
| 1,914 |
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 |
SIGMOD |
0.00010109102 |
| 6,800 |
DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models |
2024 |
SIGMOD |
4.9231471e-05 |
| 10,090 |
Integrating Vector Databases across Embedding Models |
2026 |
SIGMOD |
4.1945683e-05 |
| 8,899 |
Fast Approximate Similarity Join in Vector Databases |
2025 |
SIGMOD |
4.427232e-05 |
| 10,022 |
In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,973 |
Unstructured Data Fusion for Schema and Data Extraction |
2024 |
SIGMOD |
4.1945683e-05 |
| 3,335 |
DeepJoin: Joinable Table Discovery with Pre-trained Language Models |
2023 |
VLDB |
7.2065006e-05 |