Finding replicated web collections
Summary: Identify replicated documents and hyperlinked collections to improve crawlers, archivers, and ranking. Scalable detection over tens of millions of pages and hundreds of GBs; two real-life case studies show gains for a crawler and a search engine on a 25M-page dataset (~150 GB). (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 4 of 4 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,414 | Graph Pattern Matching: From Intractable to Polynomial Time | 2010 | VLDB | 0.00012118275 |
| 2,938 | Graph Homomorphism Revisited for Graph Matching | 2010 | VLDB | 7.8524059e-05 |
| 8,313 | Resource-Adaptive Real-Time New Event Detection | 2007 | SIGMOD | 4.5435639e-05 |
| 9,502 | Streaming Similarity Self-Join | 2016 | VLDB | 4.3341665e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 2 of 2 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 597 | Computing Iceberg Queries Efficiently | 1998 | VLDB | 0.00019475592 |
| 616 | Copy Detection Mechanisms for Digital Documents | 1995 | SIGMOD | 0.00019108201 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 2,539 | Computing PageRank in a Distributed Internet Search System | 2004 | VLDB | 8.5820857e-05 |
| 9,548 | Optimal Algorithms for Crawling a Hidden Database in the Web | 2012 | VLDB | 4.3258142e-05 |
| 12,178 | Large-Scale Copy Detection | 2011 | SIGMOD | 4.1945683e-05 |
| 13,808 | A Method of Re-ranking Web Search Results Using their Hidden Hyperlink Structure | 2002 | VLDB | - |
| 771 | Distributed Hypertext Resource Discovery Through Examples | 1999 | VLDB | 0.00016887664 |
| 3,950 | Probe, Count, and Classify: Categorizing Hidden-Web Databases | 2001 | SIGMOD | 6.5953844e-05 |
| 1,492 | Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection | 2002 | VLDB | 0.00011694396 |
| 7,768 | Accurate and Efficient Crawling for Relevant Websites | 2004 | VLDB | 4.6563056e-05 |
| 6,928 | The Evolution of the Web and Implications for an Incremental Crawler | 2000 | VLDB | 4.8925595e-05 |
| 12,669 | Self-similarity in the web | 2001 | VLDB | 4.1945683e-05 |