What Is Personal Data? Differences Between Pseudonymization, Anonymization, and Data Encryption

2020-05-21 Penta Security Blog

Despite the silent yet aggressive spread of COVID-19, most countries have managed to flatten the curve of the infection numbers, while some have significantly brought down the curve. Indeed, this is not the time to relax as we are still far from containing the virus and the future remains uncertain.

The point here is that our interventions worked. According to an early multivariate predictive model developed by Imperial College London, had there been no intervention, the disease would have infected 7 billion people around the world while killing 40 million in 2020 alone, that is nearly all of humanity. Fortunately, we are nowhere close to that number yet.

A crucial part of our intervention measures involved data. Data were utilized for predictive modeling, virus and disease research, disease tracking, and epidemiological surveillance.

We cannot speak data without sounding the privacy alarm. However, this pandemic has shown us that we can no longer resist data sharing as it becomes an integral process of our society. If we cannot avoid data sharing, then let us face it by protecting our data. Whether being a business or a consumer, the more we know, the better we can protect ourselves.

What is considered personal data?

There is no universal consensus on what constitutes “personal”. The general rule of thumb is whether the data can be used to identify a specific individual. For a more precise description, let us take a look at how “personal data” is defined in two major data protection regulations: the EU-based General Data Protection Regulation (GDPR), and the US-based California Consumer Privacy Act (CCPA).

Under GDPR, “personal data” is defined as “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

Under CCPA, “personal information” is defined as “information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.”

Note that the keyword here is “identifiability”. Personal data includes what’s called “identifiers”. Identifiers can be either direct or indirect. Common direct identifiers include the names, home addresses, phone numbers, ID numbers, email addresses, or sometimes even IP addresses. Indirect identifiers are linkable to the direct identifiers. These can include internet browsing histories, location data, employment-related information, or education information. It can be quite ambiguous to distinguish an indirect identifier from a non-identifier. Many times, an indirect identifier gets neglected and treated as a non-identifier.

In order to protect personal data and privacy, different methods can be applied in the recording and storing process. The data owner can either destroy all personal identifiers in a process called anonymization, or substitute all identifiers with non-identifiers in a process called pseudonymization. Finally, whether pseudonymized or not, data encryption should always be a final step to data protection.

Data pseudonymization

Pseudonymized data is still personal data. Pseudonymization is the practice of substituting a personal identifier with a non-identifier, so that the data subject becomes unidentifiable unless additional information is provided. For instance, instead of recording the real name (i.e. identifier) of the data subject, a randomly assigned reference number (i.e. non-identifier) would be recorded instead. It is important to note that in data pseudonymization, an identifier is always paired with a consistent non-identifier.

Taking a detailed example, let’s say a taxi company wants to track the performance of its fleet, and assigns a reference number to each driver. This allows the analytics team to study driving behavior and fuel consumption on a per-person basis. However, the analytics team would not be able to know who the actual driver is just by looking at the reference number. Thus the data here are considered to be pseudonymized.

Nevertheless, from this example, you would notice that pseudonymized data are reversible. By combining the data that links the drivers to their reference numbers, and the reference numbers to their driving behavior, the driving behavior of each data subject would be exposed. (i.e. If A = B, B = C, then A = C.) Hence, it is crucial to keep the two datasets separate in order to prevent the data subject from being identified.

Data anonymization

Anonymized data is no longer considered personal data. Anonymization is the highest level of personal data protection. (Or should we rather say personal data destruction?) It irreversibly destroys all identifiers linked to the data subject.

Back to the taxi company example again, after randomly assigning a reference number to each driver, the company completely deletes all the drivers’ names from the dataset so that there is no possible way of rematching the numbers to the names. In this case, the data is said to be anonymized.

Anonymized data should still be treated with care. Remember how we mentioned that data that look like non-identifiers may still act as indirect identifiers? Indeed, the irreversible reference numbers are certainly safe. But the problem is that people behave in similar patterns, and what seem to be non-identifiers can still somehow reveal the driver’s real identity.

Let’s look at the following anonymized longitudinal (i.e. time-series) data.

Every day, driver no. 263 drives to the airport to pick up passengers at 9 a.m., heads to downtown at 10 a.m., stops for lunch at the same restaurant, fills up gas at the same gas station, and has an average daily fuel consumption in the 8 to 9 liters per 100 km range.

The above data record is completely anonymous with no identifiers. Yet, since everyone has their own behavioral pattern, these behavioral data can still be used to trace the real identity of the data subject. Of course, these behavior patterns can only serve as a means of speculation, and do not have any legal effect in identifying an individual.

Nevertheless, when storing anonymized data, recording latitudinal data is much safer than recording longitudinal data. Take the following latitudinal data for example.

35 drivers drive to the airport to pick up passengers at 9 a.m./ 58 drivers are in the downtown area at 10 a.m./ 8 drivers eat lunch at the Subway restaurant on Queen Street./ 22 drivers fill up gas at the Esso gas station on Wilson Avenue./ 49 drivers have a daily fuel consumption in the range of 9 to 10 liters per 100 km.

Note the difference? Both datasets contain the same information. But recording them longitudinally reveals individual behavior patterns, while recording them latitudinally would not show any individual pattern.

The above mistake is commonly made by governments. During this COVID-19 outbreak, some governments disclosed anonymized longitudinal data on the activities of the confirmed patients prior to the diagnosis. These governments defend their actions by arguing that the released data are completely anonymous. However, these activities can still be recognized by the data subject’s surrounding people, such as friends and co-workers, which would likely lead to the spread of speculations and rumors that are damaging to the data subject.

Data encryption

Most businesses store consumer data for commercial uses. Under such circumstances, data anonymization is not a feasible solution. To maintain customer service, identifiers like customer names, email addresses, phone numbers, and purchase history simply cannot be destroyed.

In most commercial environments, data pseudonymization is also not applicable. This is because in order to provide prompt customer service, personal data must be readily available and interpretable to all customer service employees. Even though data protection regulations like GDPR encourage pseudonymization, due to their reversibility, pseudonymized data are not exempt from most requirements and are subject to penalties.

Whether data pseudonymization is applicable or not to your business, data encryption should always be applied, especially for large consumer databases. Data regulations do not specify the frameworks and algorithms of encryption, but the way the encryption keys are managed is commonly examined to determine what is appropriate practice or not. Server-side encryption gives the key to both the server administrator and the end-user, while end-to-end encryption only gives the key to the end-user.

There is no universal encryption framework that works for all. Every business should adopt an algorithm that best suits their situation. This is why Penta Security’s D’Amo database security solution has refused to serve any one specific encryption technology, and instead provides a thorough analysis of each customer’s system architecture to provide an optimized framework that offers the highest level of performance and security. Click here to learn more about D’Amo.

Check out Penta Security’s product lines:

Web Application Firewall: WAPPLES

Web Application Firewall for Cloud: WAPPLES SA

Database Encryption: D’Amo

Authentication: ISign+

Smart Car Security: AutoCrypt

Tags:d'amo data security encryption