Mining a Search Engine’s Corpus: Efficient Yet Unbiased Sampling and Aggregate Estimation
Summary: Unbiased sampling and online aggregate estimation over a search-engine corpus accessible only via keyword queries. Proposes provably unbiased, low-variance methods with an order-of-magnitude lower query cost, validated by theory and experiments. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Mingyang Zhang
- 2. Nan Zhang
- 3. Gautam Das
Incoming Citations (Sorted by Pagerank)
Showing 5 of 5 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,678 | Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment | 2019 | SIGMOD | 4.4702119e-05 |
| 11,722 | Deeper: A Data Enrichment System Powered by Deep Web | 2018 | SIGMOD | 4.1945683e-05 |
| 11,977 | Aggregate Estimation Over a Microblog Platform | 2014 | SIGMOD | 4.1945683e-05 |
| 12,112 | Aggregate Suppression for Enterprise Search Engines | 2012 | SIGMOD | 4.1945683e-05 |
| 13,381 | Aggregate Estimations over Location Based Services | 2015 | VLDB | - |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 5 of 5 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 449 | Approximate Query Processing: Taming the TeraBytes! A Tutorial | 2001 | VLDB | 0.00022846068 |
| 1,492 | Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection | 2002 | VLDB | 0.00011694396 |
| 2,813 | Mining Search Engine Query Logs via Suggestion Sampling | 2008 | VLDB | 8.0773142e-05 |
| 5,140 | A Random Walk Approach to Sampling Hidden Databases | 2007 | SIGMOD | 5.668209e-05 |
| 8,684 | Unbiased Estimation of Size and Other Aggregates Over Hidden Web Databases | 2010 | SIGMOD | 4.4677591e-05 |
Previous
Page 1 / 1
Next