Database Paper Browser

Back to papers

Sampling-Based Estimation of the Number of Distinct Values of an Attribute

Summary: Proposes several sampling-based estimators for the number of distinct values (NDV) of an attribute and empirically compares them on highly-skewed real-world data. Introduces a hybrid estimator that blends a smoothed jackknife with Shlosser's method, maximizing precision for given sampling fraction and scalability. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
8278
Venue
VLDB
Year
1995
Pagerank
0.00064501896
Overall Rank
59 | 99.60%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 58 citing papers.

Rank Citing Paper Year Venue Pagerank
11 Implementing Data Cubes Efficiently 1996 SIGMOD 0.0011708144
43 Models and Issues in Data Stream Systems 2002 PODS 0.00072723062
64 Improved Histograms for Selectivity Estimation of Range Predicates 1996 SIGMOD 0.00063612837
184 New Sampling-Based Summary Statistics for Improving Approximate Query Answers 1998 SIGMOD 0.00036625711
211 Join Synopses for Approximate Query Answering 1999 SIGMOD 0.00033981214
247 On the Computation of Multidimensional Aggregates 1996 VLDB 0.00030927763
308 Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports 2001 VLDB 0.00028142852
310 The Vertica Analytic Database: C-Store 7 Years Later 2012 VLDB 0.00028132402
378 Towards Estimation Error Guarantees for Distinct Values 2000 PODS 0.0002497492
449 Approximate Query Processing: Taming the TeraBytes! A Tutorial 2001 VLDB 0.00022846068
454 An Overview of Query Optimization in Relational Systems 1998 PODS 0.00022734812
530 Random Sampling for Histogram Construction: How much is enough? 1998 SIGMOD 0.00020803682
549 Tracking Join and Self-Join Sizes in Limited Storage 1999 PODS 0.00020376603
553 Bifocal Sampling for Skew-Resistant Join Size Estimation 1996 SIGMOD 0.00020272061
593 Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 1996 VLDB 0.00019536993
684 Towards a Robust Query Optimizer: A Principled and Practical Approach 2005 SIGMOD 0.00018179769
703 Query Execution Techniques for Caching Expensive Methods 1996 SIGMOD 0.00017916705
852 Dynamic Multidimensional Histograms 2002 SIGMOD 0.00015941524
1,241 Multi-dimensional Selectivity Estimation Using Compressed Histogram Information 1999 SIGMOD 0.00013097578
1,443 Compressing SQL Workloads 2002 SIGMOD 0.00011947004
1,455 RainForest - A Framework for Fast Decision Tree Construction of Large Datasets 1998 VLDB 0.00011899821
1,683 Cardinality Estimation: An Experimental Survey 2018 VLDB 0.00010922679
1,797 Effective Use of Block-Level Sampling in Statistics Estimation 2004 SIGMOD 0.00010523169
2,048 Graph Cube: On Warehousing and OLAP Multidimensional Networks 2011 SIGMOD 9.6914395e-05
2,053 Selectivity Estimation in Spatial Databases 1999 SIGMOD 9.6728745e-05
2,184 A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data 2014 SIGMOD 9.3429789e-05
2,841 Selectivity Estimation in Extensible Databases - A Neural Network Approach 1998 VLDB 8.0287389e-05
3,013 Cardinality Estimation Using Sample Views with Quality Assurance 2007 SIGMOD 7.7137441e-05
3,050 Comparing Data Streams Using Hamming Norms (How to Zero In) 2002 VLDB 7.6512619e-05
3,102 Processing Set Expressions over Continuous Update Streams 2003 SIGMOD 7.5586568e-05
3,167 Relational Confidence Bounds Are Easy With The Bootstrap* 2005 SIGMOD 7.4523397e-05
3,330 Adapting to Source Properties in Processing Data Integration Queries 2004 SIGMOD 7.2150831e-05
3,558 Approximate Selection with Guarantees using Proxies 2020 VLDB 6.9765724e-05
3,702 Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Estimates 2019 CIDR 6.8295759e-05
3,824 Correlation Sketches for Approximate Join-Correlation Queries 2021 SIGMOD 6.7260705e-05
3,842 Turbo-Charging Estimate Convergence in DBO 2009 VLDB 6.7102374e-05
4,031 Approximate Quantiles and the Order of the Stream 2006 PODS 6.5121141e-05
4,177 Density Biased Sampling: An Improved Method for Data Mining and Clustering 2000 SIGMOD 6.3835403e-05
4,185 Arnold: Declarative Crowd-Machine Data Integration 2013 CIDR 6.3776356e-05
4,833 MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions 2019 SIGMOD 5.8916346e-05
5,117 Sampling Algorithms in a Stream Operator 2005 SIGMOD 5.6825418e-05
5,340 Efficiently Approximating Query Optimizer Plan Diagrams 2008 VLDB 5.5623066e-05
5,736 Efficient Computation of Multiple Group By Queries 2005 SIGMOD 5.3482537e-05
5,982 Modeling skewed distributions using multifractals and the '80-20 law' 1996 VLDB 5.2446136e-05
6,278 Uncertainty Aware Query Execution Time Prediction 2014 VLDB 5.1309442e-05
6,941 Estimating the Impact of Unknown Unknowns on Aggregate Query Results 2016 SIGMOD 4.8924e-05
7,415 Efficient and Scalable Statistics Gathering for Large Databases in Oracle 11g 2008 SIGMOD 4.7355557e-05
7,467 Yannakakis+: Practical Acyclic Query Evaluation with Theoretical Guarantees 2025 SIGMOD 4.7218691e-05
7,603 Automated design of multidimensional clustering tables for relational databases 2004 VLDB 4.6985903e-05
7,610 Learning to be a Statistician: Learned Estimator for Number of Distinct Values 2022 VLDB 4.6965039e-05
Previous Page 1 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 4 of 4 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers