Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching
Summary: Sparkly: a distributed, share-nothing top-k TF/IDF blocker built on Lucene/Spark with automatic attribute and tokenizer selection for entity matching. Outperforms eight state-of-the-art blockers on recall/output-size/runtime, advocating TF/IDF and top-k blocking as strong, scalable baselines. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Derek Paulsen
- 2. Yash Govind
- 3. AnHai Doan
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 9,855 | Progressive Entity Matching: A Design Space Exploration | 2025 | SIGMOD | 4.269353e-05 |
| 10,040 | 3dSAGER: Geospatial Entity Resolution over 3D Objects | 2026 | SIGMOD | 4.1945683e-05 |
| 10,617 | Deduplicated Sampling On-Demand | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 12 of 12 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 4,650 | LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data | 2016 | VLDB | 6.0234336e-05 |
| 9,846 | HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs | 2025 | VLDB | 4.2721228e-05 |
| 11,373 | Generalized Supervised Meta-blocking | 2022 | VLDB | 4.1945683e-05 |
| 1,201 | SPARK: Top-k Keyword Query in Relational Databases | 2007 | SIGMOD | 0.0001334371 |
| 5,228 | Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data | 2016 | VLDB | 5.6158315e-05 |
| 1,410 | Entity Resolution with Iterative Blocking | 2009 | SIGMOD | 0.00012127555 |
| 4,974 | Supervised Meta-blocking | 2014 | VLDB | 5.7903293e-05 |
| 3,640 | Deep Learning for Blocking in Entity Matching: A Design Space Exploration | 2021 | VLDB | 6.8891671e-05 |
| 3,977 | BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution | 2016 | VLDB | 6.5736268e-05 |
| 2,514 | Comparative Analysis of Approximate Blocking Techniques for Entity Resolution | 2016 | VLDB | 8.6139012e-05 |