Dedoop: Efficient Deduplication with Hadoop
Summary: Dedoop offers browser-based specification of complex ER workflows (blocking, similarity functions, ML-generated classifiers) for MapReduce-based deduplication on Hadoop. It auto-translates workflows into MapReduce jobs, visualizes results and workload, and uses blocking-aware load balancing to minimize comparisons and balance cluster usage. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Lars Kolb
- 2. Andreas Thor
- 3. Erhard Rahm
Incoming Citations (Sorted by Pagerank)
Showing 14 of 14 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 1 of 1 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 319 | Evaluation of entity resolution approaches on real-world match problems | 2010 | VLDB | 0.00027781866 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 9,375 | Efficient Big Data Processing in Hadoop MapReduce | 2012 | VLDB | 4.347384e-05 |
| 9,115 | MapReduce Algorithms for Big Data Analysis | 2012 | VLDB | 4.3932167e-05 |
| 5,838 | HadoopDB in Action: Building Real World Applications | 2010 | SIGMOD | 5.3059032e-05 |
| 4,147 | Exploiting MapReduce-based Similarity Joins | 2012 | SIGMOD | 6.4096022e-05 |
| 5,236 | Online Deduplication for Databases | 2017 | SIGMOD | 5.611324e-05 |
| 6,042 | MDedup: Duplicate Detection with Matching Dependencies | 2020 | VLDB | 5.2405269e-05 |
| 794 | Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) | 2010 | VLDB | 0.00016605103 |
| 3,528 | Distributed Data Deduplication | 2016 | VLDB | 7.0066139e-05 |
| 13,508 | MapDupReducer: Detecting Near Duplicates over Massive Datasets | 2010 | SIGMOD | - |
| 9,266 | Redoop Infrastructure for Recurring Big Data Queries | 2014 | VLDB | 4.3667196e-05 |