Database Paper Browser

Back to papers

Finding replicated web collections

Summary: Identify replicated documents and hyperlinked collections to improve crawlers, archivers, and ranking. Scalable detection over tens of millions of pages and hundreds of GBs; two real-life case studies show gains for a crawler and a search engine on a 25M-page dataset (~150 GB). (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
3198
Venue
SIGMOD
Year
2000
Pagerank
6.8477289e-05
Overall Rank
3,683 | 74.38%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 4 of 4 citing papers.

Rank Citing Paper Year Venue Pagerank
1,414 Graph Pattern Matching: From Intractable to Polynomial Time 2010 VLDB 0.00012118275
2,938 Graph Homomorphism Revisited for Graph Matching 2010 VLDB 7.8524059e-05
8,313 Resource-Adaptive Real-Time New Event Detection 2007 SIGMOD 4.5435639e-05
9,502 Streaming Similarity Self-Join 2016 VLDB 4.3341665e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 2 of 2 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
597 Computing Iceberg Queries Efficiently 1998 VLDB 0.00019475592
616 Copy Detection Mechanisms for Digital Documents 1995 SIGMOD 0.00019108201
Previous Page 1 / 1 Next

Semantically Similar Papers