Database Paper Browser

Back to papers

LSH Ensemble: Internet-Scale Domain Search

Summary: LSH Ensemble for web domain search uses Jaccard set containment with MinHash and domain partitioning to tame skew. Cost model shows an optimal partition; equi-depth partitions suit power-law, enabling fast and accurate queries on 262M domains. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11228
Venue
VLDB
Year
2016
Pagerank
9.4542625e-05
Overall Rank
2,141 | 85.11%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 41 of 41 citing papers.

Rank Citing Paper Year Venue Pagerank
939 Data Lake Management: Challenges and Opportunities 2019 VLDB 0.00015187344
1,178 Table Union Search on Open Data 2018 VLDB 0.00013468118
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
1,644 Finding Related Tables in Data Lakes for Interactive Data Science 2020 SIGMOD 0.00011041787
2,730 Open Data Integration 2018 VLDB 8.2126735e-05
2,836 Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning 2023 VLDB 8.0443826e-05
3,000 SANTOS: Relationship-based Semantic Table Union Search 2023 SIGMOD 7.7462128e-05
3,335 DeepJoin: Joinable Table Discovery with Pre-trained Language Models 2023 VLDB 7.2065006e-05
3,358 Organizing Data Lakes for Navigation 2020 SIGMOD 7.1784949e-05
3,400 ELPIS: Graph-Based Similarity Search for Scalable Data Science 2023 VLDB 7.1405533e-05
3,824 Correlation Sketches for Approximate Join-Correlation Queries 2021 SIGMOD 6.7260705e-05
3,963 Pytheas: Pattern-based Table Discovery in CSV Files 2020 VLDB 6.5840643e-05
4,278 Similarity Query Processing for High-Dimensional Data 2020 VLDB 6.2953764e-05
4,967 Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation 2022 SIGMOD 5.7956612e-05
5,024 Towards Distribution-aware Query Answering in Data Markets 2022 VLDB 5.7535043e-05
5,200 SetSketch: Filling the Gap between MinHash and HyperLogLog 2021 VLDB 5.6337581e-05
5,789 Interactive Navigation of Open Data Linkages 2017 VLDB 5.3269741e-05
5,794 Discovering Related Data At Scale 2021 VLDB 5.3245122e-05
5,976 Responsible Data Integration: Next-generation Challenges 2022 SIGMOD 5.245976e-05
6,092 Observatory: Characterizing Embeddings of Relational Tables 2024 VLDB 5.2138566e-05
6,270 MATE: Multi-Attribute Table Extraction 2022 VLDB 5.1337451e-05
6,438 RONIN: Data Lake Exploration 2021 VLDB 5.0620163e-05
6,798 Searching Web Data using MinHash LSH 2016 SIGMOD 4.9235251e-05
7,061 Serving Deep Learning Models with Deduplication from Relational Databases 2022 VLDB 4.8463881e-05
7,582 LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes 2024 VLDB 4.7046388e-05
7,643 Cross Modal Data Discovery over Structured and Unstructured Data Lakes 2023 VLDB 4.6901105e-05
8,116 LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes 2024 VLDB 4.581507e-05
8,618 Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular Data 2024 SIGMOD 4.4838259e-05
8,717 Scotch: Generating FPGA-Accelerators for Sketching at Line Rate 2021 VLDB 4.4614498e-05
8,729 OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs 2023 VLDB 4.4582221e-05
8,910 R2D2: Reducing Redundancy and Duplication in Data Lakes 2023 SIGMOD 4.427232e-05
9,211 Optimistic Data Parallelism for FPGA-Accelerated Sketching 2023 VLDB 4.3726511e-05
10,197 Qualitative Join Discovery in Data Lakes using Examples 2026 SIGMOD 4.1945683e-05
10,540 Discovering Approximate Inclusion Dependencies 2025 VLDB 4.1945683e-05
10,685 LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes 2025 VLDB 4.1945683e-05
10,836 Data Discovery in Data Lakes: Operations, Indexes, Systems 2025 VLDB 4.1945683e-05
10,951 Determining the Largest Overlap between Tables 2024 SIGMOD 4.1945683e-05
11,025 Sampling Methods for Inner Product Sketching 2024 VLDB 4.1945683e-05
11,063 Searching Data Lakes for Nested and Joined Data 2024 VLDB 4.1945683e-05
11,168 Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation 2023 PODS 4.1945683e-05
11,183 Matching Roles from Temporal Data 2023 SIGMOD 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 9 of 9 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers