Back to papers
Smurf: Self-Service String Matching Using Random Forests
Summary: Smurf enables self-service string matching with active learning, reducing labeling by 43–76% while maintaining F1. Its RDBMS-style plan optimization reuses computations across RF trees for two string sets, advancing self-service SM and scalable RF over structured data.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 11964
- Venue
- VLDB
- Year
- 2019
- Pagerank
- 6.2195162e-05
- Overall Rank
- 4,402 | 69.38%
- DOI
-
10.14778/3291264.3291272
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 8 of 8 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,914 |
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 |
SIGMOD |
0.00010109102 |
| 4,212 |
Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration |
2023 |
SIGMOD |
6.3555142e-05 |
| 6,553 |
How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses |
2024 |
VLDB |
5.0157344e-05 |
| 6,747 |
Entity Matching Meets Data Science: A Progress Report from the Magellan Project |
2019 |
SIGMOD |
4.9408824e-05 |
| 9,355 |
Discovering Top-k Rules using Subjective and Objective Criteria |
2023 |
SIGMOD |
4.3514328e-05 |
| 10,489 |
Incremental Rule Discovery in Response to Parameter Updates |
2025 |
SIGMOD |
4.1945683e-05 |
| 11,087 |
Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity Matching |
2024 |
VLDB |
4.1945683e-05 |
| 11,483 |
Shahin: Faster Algorithms for Generating Explanations for Multiple Predictions |
2021 |
SIGMOD |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 25 of 25 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 125 |
Approximate String Joins in a Database (Almost) for Free |
2001 |
VLDB |
0.00044847972 |
| 179 |
Efficient and Extensible Algorithms for Multi Query Optimization |
2000 |
SIGMOD |
0.00037672155 |
| 250 |
Efficient set joins on similarity predicates |
2004 |
SIGMOD |
0.00030661988 |
| 266 |
Efficient Exact Set-Similarity Joins |
2006 |
VLDB |
0.00029718727 |
| 447 |
Efficient Parallel Set-Similarity Joins Using MapReduce |
2010 |
SIGMOD |
0.00022900171 |
| 643 |
Corleone: Hands-Off Crowdsourcing for Entity Matching |
2014 |
SIGMOD |
0.00018754451 |
| 712 |
Magellan: Toward Building Entity Matching Management Systems |
2016 |
VLDB |
0.00017732426 |
| 834 |
Learning Linear Regression Models over Factorized Joins |
2016 |
SIGMOD |
0.00016135159 |
| 1,043 |
Adaptive Ordering of Pipelined Stream Filters |
2004 |
SIGMOD |
0.00014476247 |
| 1,107 |
SPRINT: A Scalable Parallel Classifier for Data Mining |
1996 |
VLDB |
0.00013985717 |
| 1,167 |
Learning Generalized Linear Models Over Normalized Data |
2015 |
SIGMOD |
0.00013547713 |
| 1,476 |
Efficient Exploitation of Similar Subexpressions for Query Processing |
2007 |
SIGMOD |
0.00011779092 |
| 1,715 |
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors |
2012 |
VLDB |
0.00010803271 |
| 2,175 |
Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services |
2017 |
SIGMOD |
9.3644117e-05 |
| 2,376 |
Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance |
2010 |
SIGMOD |
8.9424361e-05 |
| 2,630 |
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce |
2009 |
VLDB |
8.4128091e-05 |
| 2,740 |
String Similarity Joins: An Experimental Evaluation |
2014 |
VLDB |
8.1980628e-05 |
| 3,141 |
ClusterJoin: A Similarity Joins Framework using Map-Reduce |
2014 |
VLDB |
7.4829448e-05 |
| 3,459 |
An Empirical Evaluation of Set Similarity Join Techniques |
2016 |
VLDB |
7.072508e-05 |
| 4,353 |
Overlap Set Similarity Joins with Theoretical Guarantees |
2018 |
SIGMOD |
6.263585e-05 |
| 4,684 |
Approximate String Joins with Abbreviations |
2018 |
VLDB |
6.0006406e-05 |
| 6,605 |
Dima: A Distributed In-Memory Similarity-Based Query Processing System |
2017 |
VLDB |
4.9965703e-05 |
| 7,109 |
Efficient Similarity Join and Search on Multi-Attribute Data |
2015 |
SIGMOD |
4.8292998e-05 |
| 9,439 |
On-the-Fly Token Similarity Joins in Relational Databases |
2014 |
SIGMOD |
4.3423824e-05 |
| 11,739 |
CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching |
2018 |
VLDB |
4.1945683e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 8,911 |
PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching |
2023 |
VLDB |
4.427232e-05 |
| 3,640 |
Deep Learning for Blocking in Entity Matching: A Design Space Exploration |
2021 |
VLDB |
6.8891671e-05 |
| 4,026 |
Flexible String Matching Against Large Databases in Practice |
2004 |
VLDB |
6.5169976e-05 |
| 300 |
Deep Learning for Entity Matching: A Design Space Exploration |
2018 |
SIGMOD |
0.00028441466 |
| 11,087 |
Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity Matching |
2024 |
VLDB |
4.1945683e-05 |
| 11,251 |
Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests |
2023 |
VLDB |
4.1945683e-05 |
| 2,175 |
Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services |
2017 |
SIGMOD |
9.3644117e-05 |
| 2,767 |
A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching |
2020 |
SIGMOD |
8.1513883e-05 |
| 5,869 |
Demonstration of Panda: A Weakly Supervised Entity Matching System |
2021 |
VLDB |
5.2959029e-05 |
| 9,409 |
Ground Truth Inference for Weakly Supervised Entity Matching |
2023 |
SIGMOD |
4.3441378e-05 |