Data Privacy & AI: When Your Smart Tech Gets a Little Too Smart

Muxin Li
Sep 17
12 min read

The Foundation: What Even IS Data Privacy?

Data privacy refers to the right of users to have control over how their information is collected, used, and shared. But here's the thing - data privacy doesn't just come from laws. There are actually several forces at play.

Market Forces: This relates to consumers deciding whether or not to let a company use their data. We vote with our wallets and our data.

Technology: Can provide better ways to ensure data privacy - you've probably heard a lot about Apple's work on ensuring your data stays on your device, as a way to protect that personal data on a device that you own, and not on their cloud servers.

Industry Self-Regulation: Some industries have gotten together and created their own agreements about data usage.

PII: The Stuff That Actually Matters

PII or Personally Identifiable Information is exactly what it sounds like. It's non-public personal information about somebody.

They can either be directly identifiable or indirectly identifiable – as in, do I know information about this person like their address and name, or is there information that I can infer based on their actions and indirectly identifiable attributes?

Examples of directly identifiable: Name, phone number, street address - the obvious stuff.

Indirectly identifiable is trickier: Say you collect info that someone is a Polish citizen living in Munich Germany, works in insurance, between 50-54 years old, drives a BMW, and lives in a certain region of the city. Even though none of that directly says "this is John Smith," you could probably figure out who it is pretty easily.

https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html?mtrref=undefined&gwh=4E0FE80AFD1853C3EC3C385126AEB0B6&gwt

Sensitive Information: Like Social Security numbers, financial and medical records are extra protected and have extra strict privacy rules like HIPAA - it's why we always have to sign a form every time we go to our doctor's office.

The Good News: If it's impossible to figure out which users are which in aggregated data, or if it's been anonymized, then PII is not as much of a concern.

The Legal Landscape: Where Your Users Are = What Laws You Follow

The data privacy laws that you have to follow depend on where your users are based. If your users are citizens of a country, you're required to follow those laws.

In the United States, you also have to account for both federal and state level laws.

FIPS: The Pirate Code of Data Privacy

FIPS or Fair Information Practices - much like the Pirates code in Pirates of the Caribbean, these aren't exactly laws as much as guidelines. They form the foundation of a large number of actual laws around the world – so chances are, understanding these principles will help you understand a large number of laws and help you stay compliant.

Basically a set of guidelines that you should probably have printed out, memorized, or stored in a place where you can easily reference.

The Four Key Themes:

Rights of Individuals

Right to know what data is collected and for what purpose. Right to choose and give consent. Right to access their own data and review it for accuracy.

Controls on Information

Information security (reasonable technical and administrative safeguards). Information quality (maintain accurate and complete PII).

Information Lifecycle

Collect PII only as stated in privacy notification. Use PII only for purposes consistent with privacy policy. Retain PII only as long as necessary to fulfill the stated purpose.

Management of PII

Accountability and enforcement of privacy policies. Monitor compliance and have procedures to address complaints.

US Laws: A Patchwork Quilt of Regulation

As of the time that this course was created (2021), the United States did not have a national standard for data privacy. Of course that could have changed very recently.

The strictest state laws come from California in the form of the California Consumer Privacy Act or CCPA - chances are that how things are done in California will continue to be a leading indicator of what stricter laws around data privacy may look like.

The Big Three Industries: Healthcare, Education, Finance

Three industries dominate in terms of having above normal data privacy concerns – healthcare, education, finance.

HIPAA - Healthcare Gets Serious

HIPAA (Health Insurance Portability and Accountability Act) - any entity that's offering health-related services (a covered entity) or does business with a covered entity needs to comply with HIPAA.

Covers Protected Health Information or PHI. Covered entities need to explain to people in detail how their data is being collected and how it's being used. If there's any other type of use for a patient's PHI information, it requires authorization from them. Covered entities have to designate a privacy official to make sure that they're compliant.

FERPA - Students Get Protection Too

FERPA or the Family Educational Rights and Privacy Act gives students control over anything related to their education data, like grades, financial info, including disciplinary records. Pretty much covers all the schools in the US - any schools that receive federal funding must follow FERPA.

Schools are allowed to list directory information like major and year that they graduated. Schools are allowed to disclose information under certain conditions like if disclosure is made to the student if they're over 18 or to the parent if they're under 18, if consent is provided by the student, or if the information has no way to personally identify a student. Students are also given the right to access their own information and review it for accuracy.

Financial Data - Multiple Laws in Play

There's at least a few key laws that govern financial data – Fair Credit Reporting Act or FCRA, Fair and Accurate Credit Transactions Act or FACTA, or the Gramm-Leach-Bliley Act or GLBA.

The financial industry has to disclose information in certain cases like money laundering.

FCRA applies when consumer credit data is used for situations like offering credit, insurance, background checks. The law requires organizations to limit the data to certain reporting purposes, and allow consumers to correct and access their data.

Organizations that use consumer credit information to make decisions have to disclose the reason why somebody may have been adversely impacted - like if you applied for a mortgage and you were rejected, they have to contact you and reference the fact that they found information in your consumer credit report.

GLBA implemented the Privacy Rule and the Safeguards Rule. The Privacy Rule created a standard for organizations on how they should notify users about the data that they were collecting. It also required that organizations allow their users to opt out of sharing their data externally. The safeguards rule established standards on how the data will be protected - it set up administrative, technical and physical safeguards.

GDPR: The Big Scary European One

Have a global customer base? You're in luck - you also need to follow GDPR policies. This will apply to you if you have any kind of assets or employees in the EU, even if you're offering your product for free, if you sell to any users based in EU or store data in the EU.

Remember how we talked about directly identifiable and indirectly identifiable data? GDPR applies to both categories.

They made this one hurt – if you don't follow GDPR policies, you can be fined 4% of your worldwide annual revenue.

Data Controllers: The New Job GDPR Created

GDPR created a new job called data controllers whose jobs are exactly what you think they are – they collect and manage PII in an organization. They're responsible for following seven key principles:

Transparency – controllers are required to inform the users what data being collected for what purposes.

Purpose Limitation - organizations must limit what they do with the data that's been collected and use only for the purposes that been communicated and disclosed to their users, and for which users have already provided express and explicit opt in consent. This is probably why every time you go to a new website, they're always asking about whether you would like to enable optional data collection for advertising or personalization purposes.

Data Minimization - Organizations are only collect the minimum amount of data necessary to fulfill the purposes that they communicated to their users.

Storage Limitation - organizations need to store the data that's been collected only for as long as needed to provide that service or fulfill that purpose.

Accuracy - Strive to ensure that the data collected is accurate

Data Security - obviously strive to maintain the security of that data. Yeah no concerns here… side eye at Equifax and other data breaches over the years.

Accountability - have an active program of monitoring to ensure compliance.

The 8 Rights GDPR Gives Individuals

Rights to be informed - you have the right to know what data is being collected and why

Right of access - you should have the right to access any data that an organization has been collecting about you

Right to rectify - you have the right to rectify or correct any data that's been collected that you've deemed inaccurate. It's like rewriting history!

Right to be forgotten - the right to be forgotten or to require an organization to delete and erase any data that's been collected about you. Time to check that privacy setting in all your social media and online accounts in case you will forever be remembered as that person with the dog memes.

Right to restrict processing - Unfortunately not all data collected about you can be easily deleted – there may be a violation of other laws or it may be impossible to delete the data. In those cases, you have the right to restrict the processing of that data and prevent organizations from using it for certain types of purposes.

Right to object - the right to object to organizations using your data for certain purposes, for instance, preventing organizations from sharing their data with other parties for direct marketing. Does somebody know how to turn this on for political campaigns? Asking for a friend.

Right to data portability - you're allowed to move your data in a format that can be transferred to a new service provider.

Right not to be subject to automated decision-making - This one is a big one if you're planning to build AI products – users have the right to prevent organizations from using their data for any automated decision-making. When we build AI to perform some form of decision-making, you have to be aware of this right and ensure that users have the ability to opt out of automated decision-making.

The AI vs Privacy Tension

It's not surprising that the need for privacy and the need for AI to ingest data in order to perform its functions are at odds with each other.

When you're building an AI model, you often have no idea which features are important. However privacy laws require that you disclose which data you're collecting and for what purposes, even though you actually have no idea.

AI models tend to require lots of data and be as feature rich as possible - even when there are weak correlation signals between the data and the results, chances are having more data tends to help with your AI model performance. However, privacy laws require that you only collect the minimum amount of data required to do the work.

Often times if you collected enough data, even if you didn't explicitly collect sensitive data, you can usually infer patterns that will suggest sensitive information. Brings up that famous story about Target being able to predict that a teenage mom was pregnant before her dad did – they could tell from the type of items that she was buying that she was likely pregnant.

Why Should You Actually Care About This Stuff?

Aside from the obvious reasons why you should care about protecting data, like for instance, being heavily fined up to 4% of your worldwide annual revenues and just being in general a good and trustworthy organization for people to do business with, it also helps you to maintain a good reputation that attracts both users and employees.

Real Examples of When Things Go Wrong

In 2019, Google was fined $57 million by a French court for violation of GDPR. The court claimed that Google had violated the privacy laws by using data that was collected from users of their Android phones for the purposes of targeted advertising, and yet had not clearly communicated to Android users that their data was going to be used in this way.

Facebook was fined $5 billion in 2019, which was the largest ever fine for a privacy related incident. In this case, it was the US Federal Trade Commission who brought the suit against Facebook, alleging that the company had deceived users in how much control they had over their own data within the Facebook platform.

How to Actually Protect User Privacy

You need to make sure that you're compliant with policy and practices. Design products with privacy in mind so that you're considering all those risks before they occur. Use technology not just cybersecurity but other new emerging technologies like differential privacy or federated learning, to help ensure we're protecting privacy even as we're training our models on the data.

Get Your Privacy Policy Right

Privacy policy should comply with all applicable laws, like which countries have jurisdiction if users are located in certain countries and also within the states of those countries. There are industry specific laws like we talked about with healthcare and finance and education.

User consent should be explicit – we should clearly spell out choices that users have and give them a way to withdraw their consent and contact us if they have any concerns about the data that we're collecting.

Privacy by Design: Prevention is Better Than Cure

The best care is prevention, so ensure that you are including privacy by design in order to mitigate risks and prevent them before you accidentally create them.

There are key principles like making sure you were proactive and not simply reactive and anticipating these issues before we roll out the product.

Privacy should always be the default setting when users enter our product – you should not require people to set up some additional setting to protect their privacy. Privacy should be embedded into the design of the product and not considered some add-on that we just bolt on at the end.

We need to make sure that there's security of the data we're collecting with technical, but also administrative safeguards to protect the data. We're providing visibility and transparency to the users about the data that we're collecting and for what purposes.

And of course we need to be user centric in our design - don't be Target targeting pregnant teens.

The Cool New Tech: Federated Learning and Differential Privacy

Take advantage of new technology like Federated learning - this is where the model is on your device like your smartphone and it updates or retrains the model using the data that's stored on the device and only sends the model update back to the central server, not sharing any of the data stored on the device back to the central server.

https://federated.withgoogle.com/

This allows for the user to maintain control of their data and keep it located on their device within their control rather than having to share it in order to train the model, and still be able to receive all the benefits towards training a model.

Another emerging approach is differential privacy, which calculates or models in a way that would make it difficult to tell who was who in all of the data used for training. It makes it possible for you to use aggregated user data, as it's usually impossible to tell which users were contained in the data set that was used for training.

How Apple Does Federated Learning

We've talked about how Apple can train their models and bring AI services to your phone without actually sending your data back to a central server – that's basically federated learning. The training is brought to the device instead of in a central location – training only occurs when the phone is eligible as in it's charging and on Wi-Fi and idle so that we're not impacting our user.

What it's basically doing is just sending the model back to the server and not the data that it trained on - basically if I figured out that these are the features that matter to predict some outcome and these are all the weights to the features, that's what I'm sending back, not your actual data. I just want the results of how to build the model.

To avoid reconstructing the data from the results sent to the server, the data is going to be encrypted right from the start – with a key that the server does not have. This is called secure aggregation, which enables the server to combine the encrypted results and only decrypt the aggregate or the entirety - imagine if each of our smartphones were sending a piece of the model back into something called the aggregate and the server is only able to decrypt the aggregate and not each individual piece.

Before anything is sent the secure aggregation protocol scrambles the training results on each device.

If one phone had really unique data, can that data be compromised and show up inside the model? In this case differential privacy is used – it's a way to deal with the risk of model memorization, or the risk that a model is being influenced by a single contributor or unique data. You don't want your models overly tuned to some weird piece of information, you want to be able to find common patterns in the data and not just memorize the thing specific to one source. You can limit any phone in how much it can contribute to the model by adding noise in order to obscure the rare data.

Because all the data is still sitting on your user's phone, that is also where you will be testing the model. Some phones will be training the model and other phones will be testing the model. Because it's learning from thousands of users, the model is pretty smart – but is still static. It will not continue to keep updating as you use it - it only learns with the next update of the model.

You can learn from everyone without learning about any one.