K-anonymity, l-diversity and t-closeness | Data Privacy Handbook (2024)

On this page: k-anonymous, l-diverse, t-close, privacy model, quantifyingprivacy, key attribute, sensitive attribute, quasi-identifier
Date of last review: 2023-05-30

K-anonymity, L-diversity and T-closeness are statistical approaches thatquantify the level of identifiability within a tabular dataset, especially whenvariables within that dataset are combined. They are complementary approaches:a dataset can be k-anonymous, L-diverse and T-close, where k, L and T allrepresent a number.

Identifiers, quasi-identifiers, and sensitive attributes

Privacy models like k-anonymity, L-diversity and T-closeness distinguishbetween 3 types of variables in a dataset:

Identifiers (also known as key attributes): direct identifiers such asnames, student numbers, email addresses, etc. These variables should inprinciple not be collected at all, or removed from the dataset if they are notnecessary for your research project.
Quasi-identifiers: indirect identifiers that can lead to identificationwhen combined with other quasi-identifiers in the dataset or externalinformation. These are often demographic variables like age, sex, place ofresidence, etc., but could also be something entirely different like physicalcharacteristics, timestamps, etc. In general, quasi-identifiers are usuallyvariables that are likely to be known to someone in the outside world.
Sensitive attributes: variables of interest which should be protected, andwhich cannot be changed, because they are the main outcome variables. Forexample, it can be Medical condition in a healthcare dataset, or Income in afinancial dataset.

It is important to correctly categorise the variables in your dataset as any ofthese variable types if you want to apply k-anonymity, l-diversity andt-closeness, because they will determine how the dataset will be de-identified.

How it works

K-anonymity

K-anonymity ensures that each individual in a dataset cannot be distinguishedfrom at least k-1 other individuals with respect to the quasi-identifiers in thedataset. This is done through generalisation,suppression and sometimes top- and bottom-coding.Applying k-anonymity makes it more difficult for an attacker to re-identifyspecific individuals in the dataset. It protects againstsingling out and, to some extent, the Mosaic effect.

Original dataset
Nr	Age	Sex	City	Disease
1	16	Male	Rotterdam	Viral infection
2	18	Male	Rotterdam	Heart-related
3	19	Male	Rotterdam	Cancer
4	22	Female	Rotterdam	Viral infection
5	22	Male	Zwolle	No illness
6	23	Male	Zwolle	Tuberculosis
7	24	Male	Zwolle	Heart-related
8	25	Female	Utrecht	Cancer
9	26	Female	Rotterdam	Heart-related
10	28	Female	Utrecht	Tuberculosis

2-anonymous dataset
Nr	Age	Sex	City	Disease
1	=< 20	Male	Rotterdam	Viral infection
2	=< 20	Male	Rotterdam	Heart-related
3	=< 20	Male	Rotterdam	Cancer
4	20-30	Female	Rotterdam	Viral infection
5	20-30	Male	Zwolle	No illness
6	20-30	Male	Zwolle	Tuberculosis
7	20-30	Male	Zwolle	Heart-related
8	20-30	Female	Utrecht	Cancer
9	20-30	Female	Rotterdam	Heart-related
10	20-30	Female	Utrecht	Tuberculosis
Colours indicate an ‘equivalence class’ of quasi-identifers

To make a dataset k-anonymous, you must first identify which variables in thedataset are identifiers, quasi-identifiers, and sensitive attributes. In theexample above, Age, Sex and City are quasi-identifiers and Disease is thesensitive attribute. Next, you should set a value for k. If we choose a k of 2,every row in the example dataset should have the same combination of Age, Sexand City as at least 1 other row in the dataset. Finally, you aggregate thedataset so that every combination of quasi-identifiers occurs at least k times.In the example, this was done by generalising Age into age categories, butthere may also be other ways to reach 2-anonymity in this dataset.

There is no single value for kwhich you should always choose.The higher the k, the more difficult it will be to identify someone, but likelyyour dataset will also become less granular and perhaps less informative. Thevalue of k will be highly dependent on what you communicated to data subjects(e.g., you may have promised a certain k) and the risk of identification thatyou are willing to accept.

The below videogives an example on how k-anonymity can work in practice:

L-diversity

L-diversity is an extension to k-anonymity that ensures that there is sufficientvariation in a sensitive attribute. This is important, because if allindividuals in a (subset of a) dataset have the same value for the sensitiveattribute, there is still a risk of inference. For example, in the below2-anonymous dataset, you can infer that any female from Rotterdam between 20 and30 who participated had a viral infection (“hom*ogeneity attack”). Similarly, ifyou know that your 25-year old female neighbour from Utrecht participated inthis study, you learn that she suffers from cancer (“background knowledge attack”).

2-anonymous dataset
Nr	Age	Sex	City	Disease
1	=< 20	Male	Rotterdam	Viral infection
2	=< 20	Male	Rotterdam	Heart-related
3	=< 20	Male	Rotterdam	Cancer
4	20-30	Female	Rotterdam	Viral infection
5	20-30	Male	Zwolle	No illness
6	20-30	Male	Zwolle	Tuberculosis
7	20-30	Male	Zwolle	Heart-related
8	20-30	Female	Utrecht	Cancer
9	20-30	Female	Rotterdam	Viral infection
10	20-30	Female	Utrecht	Cancer
Colours indicate an ‘equivalence class’ of quasi-identifers

2-anonymous 2-diverse dataset
Nr	Age	Sex	City	Disease
1	=< 20	Male	Rotterdam	Viral infection
2	=< 20	Male	Rotterdam	Heart-related
3	=< 20	Male	Rotterdam	Cancer
4	20-30	Female		Viral infection
5	20-30	Male	Zwolle	No illness
6	20-30	Male	Zwolle	Tuberculosis
7	20-30	Male	Zwolle	Heart-related
8	20-30	Female		Cancer
9	20-30	Female		Viral infection
10	20-30	Female		Cancer
Colours indicate an ‘equivalence class’ of quasi-identifers and sensitive attributes

K-anonymity does not protect against suchhom*ogeneity and background knowledge attacks.Therefore, L-diversity proposes that there should be at least L different valuesfor the sensitive attribute per combination of quasi-identifiers. In the exampleabove, if we choose an L of 2, that means that for each combination of Age, Sexand City, there are at least 2 distinct diseases. In the example, we suppressedCity for these hom*ogeneous cases, so that all females between 20 and 30 yearsold can either have cancer or a viral infection.

Like k-anonymity, there isno perfect value of L,although it is usually less or equal to k and more than 1.

The below videoexplains the concept of L-diversity using an example:

T-closeness

T-closeness ensures that the distribution of a sensitive attribute within ageneralisation of a quasi-identifier is close to the distribution of thesensitive attribute in the entire dataset. In other words, it ensures that thesensitive attribute is not skewed towards a specific value within a group ofsimilar individuals, which could potentially be used to re-identify someone.For example, if a dataset contains information on Age (quasi-identifier), Sex(quasi-identifier), and Income (sensitive attribute), and t-closeness is appliedwith a value of t = 0.1, then for each combination of Age and Sex, thedistribution of income must be within 10% of the distribution of income in theentire dataset.

T-closeness can get complicated quite fast. If you’re curious to know how itworks, the below videoexplains the concept of t-closeness using an example:

When to use

K-anonymity, L-diversity and t-closeness are usually applied to de-identifytabular datasets, before being shared. They are also most suitable forrelatively large datasets (i.e., containing a large number of individuals), asmore details (utility) are likely to be retained in such datasets(source).

Implications for research

It is very easy to lose a lot of the (granularity of the) data when satisfyingthe k-, L- or T-criteria: the higher the criteria, the lower the risk ofre-identification, but the more information you lose. The balance betweenprivacy and utility is therefore very important to take into consideration whenapplying these privacy models.
The more variables (quasi-identifiers), the larger the dataset and the moreoutliers there are in the dataset, the more difficult de-identification will bewithout losing too much information(as shown here).
If a dataset is k-anonymous, L-diverse or T-close, that does not mean thatthe dataset is also considered anonymous under the GDPR. Thedegree of anonymity after applying these approaches depends entirely on yourown choices in terms of k, L or T, in terms of the variables that you included,and on the context of your dataset. For example, if you failed to include aquasi-identifier in k-anonymising your dataset, your dataset is in reality notk-anonymous.