Cardinality and selectivity can provide measurable insights into your data set and effectiveness of the index, pointing you to where specific optimizations can be made.

Cardinality and selectivity play a crucial role in index tuning and optimization as they can provide measurable insights into your data set and effectiveness of the index, pointing you to where specific optimizations can be made.

Cardinality refers to the individual uniqueness of values in a specific index key. Each index key (document property) emitted into the index will have varying degrees of cardinality. Cardinality can be broken down into roughly 3 different types:

*High-Cardinality*: Refers to values that are unique or very uncommon within the index key. Examples include fields such as GUIDs, IDs, email addresses, and usernames.*Normal-Cardinality*: Refers to values that are somewhat uncommon but not necessarily unique within the index key. Examples include:**first**/**middle**/**last name**,**zip codes**. There are last names / surnames that very well may be unique in the data set, however, if you were to examine all of the distinct values you'll find groupings of certain values (i.e. Jones).*Low-Cardinality*: Refers to values that are common within the data set and have very few possible values. Examples include status, gender, and booleans. Fields that have little uniqueness and are common across the index, examples are status, gender, and booleans.

Selectivity is the measure of variation in unique values in a given data set and it is represented as a number between `0 - 1`

or `0 - 100%`

. The formula to calculate selectivity can be represented as follows:

`selectivity = cardinality/(number of records) * 100`

or more simply stated:

`Number of Distinct Values / Total number of Records = Selectivity`

Cardinality and Selectivity can be applied to any "data set" such as an index, query or bucket. In general for database indexes, the higher cardinality -> better selectivity -> faster scans -> increased performance. Consider the table below:

Name | Breed | Gender | Origin Country | |
---|---|---|---|---|

1 | Oakley | German Shepherd | M | Germany |

2 | Zeus | Doberman Pinscher | M | Germany |

3 | Darby | Doberman Pinscher | F | Germany |

4 | Rocky | Bulldog | M | United Kingdom |

5 | Lucy | Labrador Retriever | F | Canada |

6 | Buddy | Golden Retriever | M | United Kingdom |

7 | Molly | Pug | F | China |

8 | Sadie | Labrador Retriever | F | Canada |

9 | Max | Boxer | M | Germany |

10 | Simba | Great Dane | M | Germany |

Cardinality | 10 | 6 | 2 | 4 |

Selectivity | 100% | 60% | 20% | 40% |

Using the `travel-sample`

bucket, we'll calculate two selectivity values for some of the sample indexes:

*Projection Selectivity*: This is a measure of the # of documents in the bucket that match the index filter/`WHERE`

predicate and contain the leading field. This is often referred to as "index segmentation".*Index Selectivity*: This is a measure of the number of unique values in the index compared to the total # of entries in the index.

For optimum performance, you will want a relatively low percentage of Projection Selectivity as this means the index is smaller, and a higher value for Index selectivity as this means there is a lot of uniqueness within the index.

Initially, we need to get the total # of documents in the bucket, as we will reuse this value in all of our calculations:

```
SELECT RAW COUNT(1)
FROM `travel-sample`
```

`[31591]`

`def_type`

index`CREATE INDEX `def_type` ON `travel-sample`(`type`)`

Determine the total number of records in the index, this query will push the `COUNT()`

down to the indexer, and we trigger the use of the index by referencing the first field in the index. If needed you could optionally specify a `USE INDEX()`

statement to ensure the index is used:

```
SELECT COUNT(1)
FROM `travel-sample`
WHERE type IS NOT MISSING
```

`[31591]`

Next, we need to determine the total number of possible unique values in the index:

```
SELECT RAW COUNT(DISTINCT type)
FROM `travel-sample`
WHERE type IS NOT MISSING
```

`[5]`

Description | Formula | Selectivity |
---|---|---|

Projection Selectivity | (31591 / 31591) * 100 | 100% |

Index Selectivity | (5 / 31591) * 100 | 0.015% |

`def_faa`

index`CREATE INDEX `def_faa` ON `travel-sample`(`faa`)`

Determine the total number of records in the index:

```
SELECT RAW COUNT(1)
FROM `travel-sample`
WHERE faa IS NOT MISSING
```

`[1968]`

Next, we need to determine the total number of possible unique values in the index:

```
SELECT RAW COUNT(DISTINCT faa)
FROM `travel-sample`
WHERE faa IS NOT MISSING
```

`[1708]`

Description | Formula | Selectivity |
---|---|---|

Projection Selectivity | (1968 / 31591) * 100 | 6.23% |

Index Selectivity | (1708 / 1968) * 100 | 86.79% |

`def_country`

index`CREATE INDEX `def_country` ON `travel-sample`(`country`, `type`)`

Determine the total number of records in the index:

```
SELECT RAW COUNT(1)
FROM `travel-sample`
WHERE country IS NOT MISSING
```

`[7567]`

Next, we need to determine the total number of possible unique values in the index. For this example, however, there are two index keys `country`

and `type`

. The selectivity depends on how the index will be used and when optimizing it is important to understand how the cardinality of one key can affect the other.

```
SELECT *
FROM `travel-sample`
WHERE country = 'United States'
```

```
SELECT RAW COUNT(DISTINCT country)
FROM `travel-sample`
WHERE country IS NOT MISSING
```

`[3]`

```
SELECT *
FROM `travel-sample`
WHERE country = 'United States' AND type = 'landmark'
```

When both keys are used, the selectivity can be described in two ways, the first is the total uniqueness of both keys when combined together:

```
SELECT RAW COUNT(DISTINCT country || type)
FROM `travel-sample`
WHERE country IS NOT MISSING
```

`[12]`

The second is # of unique keys for the second index key which matches the previous index key:

```
SELECT RAW COUNT(1)
FROM `travel-sample`
WHERE country = 'United States'
```

`[3948]`

```
SELECT RAW COUNT(DISTINCT type)
FROM `travel-sample`
WHERE country = 'United States'
```

`[4]`

Description | Formula | Selectivity |
---|---|---|

Projection Selectivity | (7567 / 31591) * 100 | 23.95% |

Index Selectivity (3.a) | (3 / 7567) * 100 | 0.039% |

Index Selectivity (3.b) | (12 / 7567) * 100 | 0.16% |

Index Selectivity (3.c) | (4 / 3948) * 100 | 0.10% |

Both cardinality and selectivity can affect the performance of IndexScans, and you should always consider their implications as it relates to your access patterns and query predicates. Having a solid understanding of cardinality and selectivity as it relates to your data set can provide solid guidance in the tuning and determining the order of index keys within the index.

This tutorial is part of a
Couchbase Learning Path:

Was this page helpful?