Back to papers
Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search
Summary: Fainder introduces a distribution-aware index for percentile predicates over heterogeneous histogram summaries, enabling dataset discovery based on distributional properties rather than keywords. It uses binary search plus multi-step pruning on summary bounds to prune candidates and yields order-of-magnitude speedups.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13540
- Venue
- VLDB
- Year
- 2024
- Pagerank
- 4.2511622e-05
- Overall Rank
- 9,928 | 30.94%
- DOI
-
10.14778/3681954.3681999
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 4 of 4 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 18 of 18 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 326 |
Optimal Histograms with Quality Guarantees |
1998 |
VLDB |
0.00027358981 |
| 610 |
Goods: Organizing Google's Datasets |
2016 |
SIGMOD |
0.00019232674 |
| 1,612 |
Detecting Data Errors: Where are we and what needs to be done? |
2016 |
VLDB |
0.00011142794 |
| 1,644 |
Finding Related Tables in Data Lakes for Interactive Data Science |
2020 |
SIGMOD |
0.00011041787 |
| 1,751 |
Auctus: A Dataset Search Engine for Data Discovery and Augmentation |
2021 |
VLDB |
0.00010683295 |
| 3,358 |
Organizing Data Lakes for Navigation |
2020 |
SIGMOD |
7.1784949e-05 |
| 3,520 |
GitTables: A Large-Scale Corpus of Relational Tables |
2023 |
SIGMOD |
7.0131061e-05 |
| 5,024 |
Towards Distribution-aware Query Answering in Data Markets |
2022 |
VLDB |
5.7535043e-05 |
| 5,381 |
Selective Data Acquisition in the Wild for Model Charging |
2022 |
VLDB |
5.5399508e-05 |
| 5,794 |
Discovering Related Data At Scale |
2021 |
VLDB |
5.3245122e-05 |
| 6,270 |
MATE: Multi-Attribute Table Extraction |
2022 |
VLDB |
5.1337451e-05 |
| 6,438 |
RONIN: Data Lake Exploration |
2021 |
VLDB |
5.0620163e-05 |
| 6,467 |
Tailoring Data Source Distributions for Fairness-aware Data Integration |
2021 |
VLDB |
5.0528156e-05 |
| 6,944 |
DataPrism: Exposing Disconnect between Data and Systems |
2022 |
SIGMOD |
4.8912787e-05 |
| 7,303 |
DICE: Data Discovery by Example |
2021 |
VLDB |
4.7684686e-05 |
| 7,851 |
Consistent Range Approximation for Fair Predictive Modeling |
2023 |
VLDB |
4.6353072e-05 |
| 7,868 |
Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach |
2023 |
SIGMOD |
4.6319504e-05 |
| 8,618 |
Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular Data |
2024 |
SIGMOD |
4.4838259e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 1,808 |
Top-k Query Evaluation with Probabilistic Guarantees |
2004 |
VLDB |
0.00010486213 |
| 11,598 |
IDAR: Fast Supergraph Search Using DAG Integration |
2020 |
VLDB |
4.1945683e-05 |
| 7,915 |
HINT: A Hierarchical Index for Intervals in Main Memory |
2022 |
SIGMOD |
4.617775e-05 |
| 3,131 |
FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems |
2022 |
VLDB |
7.4985793e-05 |
| 9,176 |
RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets |
2016 |
SIGMOD |
4.383548e-05 |
| 10,960 |
FairHash: A Fair and Memory/Time-efficient Hashmap |
2024 |
SIGMOD |
4.1945683e-05 |
| 11,251 |
Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests |
2023 |
VLDB |
4.1945683e-05 |
| 11,379 |
Fast Dataset Search with Earth Mover’s Distance |
2022 |
VLDB |
4.1945683e-05 |
| 10,439 |
Finding What You’re Looking For: A Distribution-Aware Dataset Search Engine in Action |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,341 |
A Theoretical Framework for Distribution-Aware Dataset Search |
2025 |
PODS |
4.1945683e-05 |