Database Paper Browser

Back to papers

Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services

Summary: Falcon scales hands-off crowdsourced EM beyond Corleone with RDBMS-style planning on Hadoop. It defines EM operators, turns workflows into executable plans mixing machine and crowd tasks, using crowd time to mask machine time for million-tuple cloud-scale EM. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5326
Venue
SIGMOD
Year
2017
Pagerank
9.3644117e-05
Overall Rank
2,175 | 84.88%
DOI
10.1145/3035918.3035960

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 30 of 30 citing papers.

Rank Citing Paper Year Venue Pagerank
754 Distributed Representations of Tuples for Entity Resolution 2018 VLDB 0.00017117211
1,914 Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks 2020 SIGMOD 0.00010109102
2,767 A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching 2020 SIGMOD 8.1513883e-05
3,640 Deep Learning for Blocking in Entity Matching: A Design Space Exploration 2021 VLDB 6.8891671e-05
3,773 Cleaning Crowdsourced Labels Using Oracles for Statistical Classification 2019 VLDB 6.7758649e-05
4,212 Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration 2023 SIGMOD 6.3555142e-05
4,278 Similarity Query Processing for High-Dimensional Data 2020 VLDB 6.2953764e-05
4,402 Smurf: Self-Service String Matching Using Random Forests 2019 VLDB 6.2195162e-05
4,607 Data Integration and Machine Learning: A Natural Synergy 2018 SIGMOD 6.0538827e-05
4,989 BEER: Blocking for Effective Entity Resolution 2021 SIGMOD 5.7827362e-05
5,622 Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach 2020 SIGMOD 5.4060403e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,690 Parallel Discrepancy Detection and Incremental Detection 2021 VLDB 4.9621556e-05
6,747 Entity Matching Meets Data Science: A Progress Report from the Magellan Project 2019 SIGMOD 4.9408824e-05
6,868 Cost-Effective Data Annotation using Game-Based Crowdsourcing 2019 VLDB 4.9010083e-05
7,243 Data Integration and Machine Learning: A Natural Synergy 2018 VLDB 4.7913666e-05
7,668 Human-in-the-loop Data Integration 2017 VLDB 4.6834075e-05
8,005 Online Topic-Aware Entity Resolution Over Incomplete Data Streams 2021 SIGMOD 4.6081461e-05
8,099 Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching 2023 VLDB 4.5859317e-05
8,384 Consistent and Flexible Selectivity Estimation for High-Dimensional Data 2021 SIGMOD 4.5304673e-05
8,908 Deep Active Alignment of Knowledge Graph Entities and Schemata 2023 SIGMOD 4.427232e-05
9,434 Rock: Cleaning Data by Embedding ML in Logic Rules 2024 SIGMOD 4.3430376e-05
9,487 Making It Tractable to Catch Duplicates and Conflicts in Graphs 2023 SIGMOD 4.3341665e-05
9,832 Balance-Aware Distributed String Similarity-Based Query Processing System 2019 VLDB 4.2751057e-05
9,846 HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs 2025 VLDB 4.2721228e-05
10,022 In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration 2026 SIGMOD 4.1945683e-05
10,617 Deduplicated Sampling On-Demand 2025 VLDB 4.1945683e-05
11,223 Splitting Tuples of Mismatched Entities 2023 SIGMOD 4.1945683e-05
11,230 VersaMatch: Ontology Matching with Weak Supervision 2023 VLDB 4.1945683e-05
11,739 CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching 2018 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 30 of 30 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
94 CrowdDB: Answering Queries with Crowdsourcing 2011 SIGMOD 0.00051013264
119 Answering Queries using Humans, Algorithms and Databases 2011 CIDR 0.0004564788
199 Declarative Data Cleaning: Language, Model, and Algorithms 2001 VLDB 0.00035041015
249 Crowdsourced Databases: Query Processing with People 2011 CIDR 0.00030740523
250 Efficient set joins on similarity predicates 2004 SIGMOD 0.00030661988
263 CrowdER: Crowdsourcing Entity Resolution 2012 VLDB 0.00029862413
267 Human-powered Sorts and Joins 2012 VLDB 0.00029690405
447 Efficient Parallel Set-Similarity Joins Using MapReduce 2010 SIGMOD 0.00022900171
643 Corleone: Hands-Off Crowdsourcing for Entity Matching 2014 SIGMOD 0.00018754451
697 Human-Assisted Graph Search: It’s Okay to Ask Questions 2011 VLDB 0.00018043655
859 So Who Won? Dynamic Max Discovery with the Crowd 2012 SIGMOD 0.00015870894
866 Leveraging Transitive Relations for Crowdsourced Joins 2013 SIGMOD 0.00015801196
1,074 Processing Theta-Joins using MapReduce* 2011 SIGMOD 0.00014260096
1,164 CrowdScreen: Algorithms for Filtering Data with Humans 2012 SIGMOD 0.00013564823
1,234 Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints 2008 VLDB 0.00013122499
1,242 Question Selection for Crowd Entity Resolution 2013 VLDB 0.00013096655
1,841 Crowdsourcing Algorithms for Entity Resolution 2014 VLDB 0.00010348858
2,334 Counting with the Crowd 2013 VLDB 9.0161817e-05
2,946 BigDansing: A System for Big Data Cleansing 2015 SIGMOD 7.8372441e-05
3,100 Crowd Mining 2013 SIGMOD 7.5634778e-05
3,118 Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning 2015 VLDB 7.5379338e-05
3,141 ClusterJoin: A Similarity Joins Framework using Map-Reduce 2014 VLDB 7.4829448e-05
3,528 Distributed Data Deduplication 2016 VLDB 7.0066139e-05
4,050 An Efficient Partition Based Method for Exact Set Similarity Joins 2016 VLDB 6.4953612e-05
4,216 Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints 2010 VLDB 6.3521675e-05
4,451 CLAMShell: Speeding up Crowds for Low-latency Data Labeling 2016 VLDB 6.1738675e-05
5,362 Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach 2016 SIGMOD 5.5473503e-05
6,806 Query Optimization over Crowdsourced Data 2013 VLDB 4.9218336e-05
7,109 Efficient Similarity Join and Search on Multi-Attribute Data 2015 SIGMOD 4.8292998e-05
8,593 Wisteria: Nurturing Scalable Data Cleaning Infrastructure 2015 VLDB 4.4891474e-05
Previous Page 1 / 1 Next

Semantically Similar Papers