Sampling Dirty Data for Matching Attributes
Summary: Sampling dirty relational data to reveal overlapping string-value sets for joins. Proposes measures blending set-overlap and string-instance similarity, with distributed sampling and comparisons; adds a two-stage filter balancing accuracy and speed. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Henning Köhler
- 2. Xiaofang Zhou
- 3. Shazia Sadiq
- 4. Yanfeng Shu
- 5. Kerry Taylor
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 4,297 | Robust Set Reconciliation | 2014 | SIGMOD | 6.2885419e-05 |
| 7,759 | Dscaler: Synthetically Scaling A Given Relational Database | 2016 | VLDB | 4.6593145e-05 |
| 10,817 | Mining Meaningful Keys and Foreign Keys with High Precision and Recall | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 14 of 14 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,490 | Leveraging Set Relations in Exact Set Similarity Join | 2017 | VLDB | 7.0465856e-05 |
| 4,901 | Probabilistic String Similarity Joins | 2010 | SIGMOD | 5.8411648e-05 |
| 46 | Simple Random Sampling from Relational Databases | 1986 | VLDB | 0.00070894702 |
| 11,979 | Similarity Joins for Uncertain Strings | 2014 | SIGMOD | 4.1945683e-05 |
| 3,529 | Merging the Results of Approximate Match Operations | 2004 | VLDB | 7.0059524e-05 |
| 155 | Robust and Efficient Fuzzy Match for Online Data Cleaning | 2003 | SIGMOD | 0.00040637896 |
| 9,563 | Towards a Unified Framework for String Similarity Joins | 2019 | VLDB | 4.3254416e-05 |
| 2,184 | A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data | 2014 | SIGMOD | 9.3429789e-05 |
| 2,740 | String Similarity Joins: An Experimental Evaluation | 2014 | VLDB | 8.1980628e-05 |
| 4,026 | Flexible String Matching Against Large Databases in Practice | 2004 | VLDB | 6.5169976e-05 |