Back to papers
Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation
Summary: Proposes scalable near-duplicate sequence search to measure LLM memorization in trillion-token corpora. The approach groups min-hash values for all sequences with at least t tokens in linear time, uses inverted indexes and prefix filtering, and proves a bound 2^{(n+1)/(t+1)}−1, with real-world validation.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 6682
- Venue
- SIGMOD
- Year
- 2023
- Pagerank
- 4.2667743e-05
- Overall Rank
- 9,876 | 31.30%
- DOI
-
10.1145/3589324
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 13 of 13 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 616 |
Copy Detection Mechanisms for Digital Documents |
1995 |
SIGMOD |
0.00019108201 |
| 705 |
Winnowing: Local Algorithms for Document Fingerprinting |
2003 |
SIGMOD |
0.00017864657 |
| 1,396 |
Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search |
2012 |
SIGMOD |
0.00012204748 |
| 1,674 |
Adaptive Parallel Aggregation Algorithms |
1995 |
SIGMOD |
0.0001094787 |
| 2,592 |
Pass-Join: A Partition-based Method for Similarity Joins |
2012 |
VLDB |
8.4795761e-05 |
| 4,050 |
An Efficient Partition Based Method for Exact Set Similarity Joins |
2016 |
VLDB |
6.4953612e-05 |
| 4,250 |
Local Similarity Search for Unstructured Text |
2016 |
SIGMOD |
6.3241139e-05 |
| 4,353 |
Overlap Set Similarity Joins with Theoretical Guarantees |
2018 |
SIGMOD |
6.263585e-05 |
| 5,073 |
Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction |
2011 |
SIGMOD |
5.7177424e-05 |
| 6,726 |
A Pivotal Prefix Based Filtering Algorithm for String Similarity Search |
2014 |
SIGMOD |
4.9484027e-05 |
| 7,635 |
Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts |
2021 |
SIGMOD |
4.6908858e-05 |
| 8,291 |
TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection |
2022 |
SIGMOD |
4.5435639e-05 |
| 9,567 |
META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion |
2016 |
VLDB |
4.3254416e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 9,805 |
MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training |
2025 |
SIGMOD |
4.2805224e-05 |
| 7,635 |
Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts |
2021 |
SIGMOD |
4.6908858e-05 |
| 10,064 |
Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,452 |
ScaleLLM: A Technique for Scalable LLM-augmented Data Systems |
2025 |
SIGMOD |
4.1945683e-05 |
| 7,700 |
Near-Duplicate Text Alignment with One Permutation Hashing |
2024 |
SIGMOD |
4.6744372e-05 |
| 13,138 |
Database Perspective on LLM Inference Systems |
2025 |
VLDB |
- |
| 8,291 |
TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection |
2022 |
SIGMOD |
4.5435639e-05 |
| 10,022 |
In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration |
2026 |
SIGMOD |
4.1945683e-05 |
| 11,058 |
LLM-PBE: Assessing Data Privacy in Large Language Models |
2024 |
VLDB |
4.1945683e-05 |
| 10,499 |
Privacy and Accuracy-Aware AI/ML Model Deduplication |
2025 |
SIGMOD |
4.1945683e-05 |