Back to papers
Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts
Summary: Allign uses a min-hash based method to align all-pair near-duplicate passages in two texts, avoiding O(n^2 m^2) enumeration via compact windows. It matches windows by shared min-hash, reports the longest and sentence-level near-duplicates, and outperforms prior alignment methods on real data.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 6237
- Venue
- SIGMOD
- Year
- 2021
- Pagerank
- 4.6908858e-05
- Overall Rank
- 7,635 | 46.89%
- DOI
-
10.1145/3448016.3457548
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 6 of 6 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 14 of 14 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 34 |
Similarity Search in High Dimensions via Hashing |
1999 |
VLDB |
0.00076637636 |
| 616 |
Copy Detection Mechanisms for Digital Documents |
1995 |
SIGMOD |
0.00019108201 |
| 705 |
Winnowing: Local Algorithms for Document Fingerprinting |
2003 |
SIGMOD |
0.00017864657 |
| 1,305 |
Bayesian Locality Sensitive Hashing for Fast Similarity Search |
2012 |
VLDB |
0.00012687101 |
| 1,396 |
Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search |
2012 |
SIGMOD |
0.00012204748 |
| 2,592 |
Pass-Join: A Partition-based Method for Similarity Joins |
2012 |
VLDB |
8.4795761e-05 |
| 3,578 |
Efficient Approximate Entity Extraction with Edit Distance Constraints |
2009 |
SIGMOD |
6.9503858e-05 |
| 4,050 |
An Efficient Partition Based Method for Exact Set Similarity Joins |
2016 |
VLDB |
6.4953612e-05 |
| 4,250 |
Local Similarity Search for Unstructured Text |
2016 |
SIGMOD |
6.3241139e-05 |
| 4,353 |
Overlap Set Similarity Joins with Theoretical Guarantees |
2018 |
SIGMOD |
6.263585e-05 |
| 4,808 |
On the Complexity of Inner Product Similarity Join |
2016 |
PODS |
5.908896e-05 |
| 5,073 |
Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction |
2011 |
SIGMOD |
5.7177424e-05 |
| 6,074 |
Pigeonring: A Principle for Faster Thresholded Similarity Search |
2019 |
VLDB |
5.2242306e-05 |
| 6,726 |
A Pivotal Prefix Based Filtering Algorithm for String Similarity Search |
2014 |
SIGMOD |
4.9484027e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 8,035 |
A New Approach for Processing Ranked Subsequence Matching Based on Ranked Union |
2011 |
SIGMOD |
4.6009403e-05 |
| 1,234 |
Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints |
2008 |
VLDB |
0.00013122499 |
| 9,876 |
Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation |
2023 |
SIGMOD |
4.2667743e-05 |
| 9,933 |
Efficient and Effective KNN Sequence Search with Approximate n-grams |
2014 |
VLDB |
4.2500258e-05 |
| 2,592 |
Pass-Join: A Partition-based Method for Similarity Joins |
2012 |
VLDB |
8.4795761e-05 |
| 7,708 |
Efficient Top-k Algorithms for Approximate Substring Matching |
2013 |
SIGMOD |
4.6721808e-05 |
| 4,250 |
Local Similarity Search for Unstructured Text |
2016 |
SIGMOD |
6.3241139e-05 |
| 10,266 |
Near-Duplicate Text Alignment under Weighted Jaccard Similarity |
2026 |
VLDB |
4.1945683e-05 |
| 7,700 |
Near-Duplicate Text Alignment with One Permutation Hashing |
2024 |
SIGMOD |
4.6744372e-05 |
| 8,291 |
TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection |
2022 |
SIGMOD |
4.5435639e-05 |