Data Anonymization

Data anonymization protects the confidential information of user by altering or encoding the identifiers that link the individuals and stored information.

Data Science | 02/24/2021 UTC
Blog Hashbrown

Data anonymization is the process by which personal data is irreversibly altered. Data is altered in such a manner so that the data subject can no longer be identified directly or indirectly, either by the data controller single handedly or in collaboration with any other party. Data anonymization may enable the transfer of information across a boundary, such as between two departments within an agency or between two agencies, while reducing the risk of unintended disclosure. For an instance, in case of medical data, anonymized data refers to data from which the patient cannot be identified by the recipient of the information.  

Data altered across systems by Data anonymization techniques can't be traced back to a specific individual. Data's format and referential integrity is preserved during the process. It is one of the various approaches organizations can use to conform to demanding data privacy laws  that require the protection of Personally Identifiable Information (PII) such as contact information, health records, or financial details

The ultimate goal of de-identification is to safeguard the confidentiality of the original data and ensure that the identity of a person cannot be presumed from the anonymized data. Once this is achieved, the anonymized data does not fall within the scope of GDPR (General Data Protection Regulation) as it no longer counts as “personal data”

The General Data Protection Regulation (GDPR) outlines a specific set of rules that protect user data and create transparency. As long as companies remove all identifiers from the data, 

GDPR allows companies to collect anonymized data without consent, use it for any purpose, and store it for an indefinite time—as long as companies remove all identifiers from the data. 

 

Anonymization has the following benefits:  

  1. Stronger information security and analogous to cyber security measures 
  2. Risk minimization regarding information transfers 
  3. Possible information reuse 
  4. Application of automated Big Data techniques 
  5. Cost-saving resulting from a reduction of fines due to law enforcement 

 

Data Anonymization Techniques:

1. Data Masking

Hiding data with altered values. Mirror version of a database can be created and further modification techniques such as character shuffling, encryption, and word or character substitution can also be applied. For example, we can replace a value character with a symbol such as “*” or “x”. Data masking makes reverse engineering or detection impossible. 

 

2. Pseudonymization

A Data management and de-identification method which replaces private identifiers with fake identifiers/pseudonyms, for example replacing the identifier ‘Mark Smith’ with ‘Rahul Spencer’. While protecting data privacy, Pseudonymization preserves statistical accuracy and data integrity, allowing the modified data to be used for training, development, testing, and analytics. 

 

3. Generalization 

This method deliberately removes some of the data to make it less identifiable. The house number in an address can be removed, but make sure not to remove the road name. The purpose is to eliminate some of the identifiers while retaining a measure of data accuracy. 

 

4. Data Swapping (Shuffling & Permutation)

A technique used to rearrange the dataset attribute values so that they don’t correspond with the original records. Swapping attributes (columns) that contain identifiers values such as date of birth. 

 

5. Data Perturbation 

It modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range of values should be in proportion to the perturbation. A small base may lead to weak anonymization while a large base may reduce the effectiveness of the dataset. For example, we can use a base of 5 for rounding values like age or house number because it’s proportional to the original value. We can multiply a house number by 15 and the value may retain its reliability. However, using higher bases like 15 can make the age values seem fake.

 

6.  Synthetic Data 

This kind of data is algorithmically manufactured information that has no connection to real events. It can be used to create artificial datasets instead of altering the original dataset or using it as is and risking privacy and security. The process includes creating statistical models based upon patterns found in the original dataset. Methods like standard deviations, medians, linear regression or other statistical techniques can be used to generate the synthetic data. 

Similar Stories