Privacy Preserving Technologies

Speakers:

Paul Starrett, Patricia Thaine

Transcript:

Paul Starrett
Hello, and welcome to our podcast today sponsored by PrivacyLabs. My name is Paul Starrett, the founder of PrivacyLabs. Remember, PrivacyLabs is one word we have the pleasure of having Patricia Thaine here today, I’m very honored to have her, her background is essentially unrivaled in her area. What we’re going to do today is try to bring the perspective of what does privacy preservation mean, that’s a fairly broad topic technically and its application to the types of data and the ways in which it can bring a solution to your compliance needs is quite broad. So we’re going to try and wrangle that first and then get down into some, some specifics. So with that said, what I’ll do, Patricia, if you could tell us about yourself, and about your company, and then we’ll get into the first question.

Patricia Thaine
That was great. Paul, thank you so much for having me, it’s an honor to be here. So I am a PhD candidate at the University of Toronto. My research is on privacy-preserving natural language and spoken language processing. And I am the co founder and CEO of a company called Private AI and there we make it super easy for developers to integrate privacy into their software pipelines.

Paul Starrett
Got it. Great. Okay, great. And just so people know, it’s your company is at private-ai.ca, or private dash or hyphen IA.ca. Because you’re in Canada, Toronto based, if I’m not mistaken. So I think the first question really is, there are many different technical solutions to preserving privacy. And I wanted to give us a sense of the different buckets of those different areas. And how they are sort of distinguishable.

Patricia Thaine
Absolutely. And I actually have a privacy preserving technologies decision tree that you can take a look at at the private AI website, it can get quite confusing because one, you have to understand the technology, you have to understand the use cases and you have to understand when they fit in, depending on which regulations you want to comply with, for example. So you’ll often hear about homomorphic encryption, secure multi party computation, anonymization, synthetic data generation. And in a lot of cases, you might hear these technologies being pitted up against one another. But really, in the best of all worlds is these are all complimentary. These aren’t one or the other. You can’t solve everything with homomorphic encryption, you can’t solve everything within anonymization, you can’t solve everything with synthetic data generation. So if you’re looking at homomorphic encryption, for example, how what might be some good use cases is if you want to compute something in the cloud. And it’s personal data that you’re ingesting from users, but it’s a fairly repeatable computation. Or you could search through a database to find the result of a query, for example, and then send it back to the devices on the edge. So homomorphic encryption allows you to add / multiply numbers that are encrypted, while they’re still in their encrypted form. And then you when you decrypt the output that can be just on the user side. So that is a really cool technology. It has limited use cases. But those limited use cases make it so that if you do want to use this technology, it tends to be for things like credit card numbers for really, really sensitive information, where you’re okay taking an extra amount of processing time in order to do this, because the downside of having that information leaked is so huge that you need to prevent it in any way possible. Got it Yeah, in terms of things like differential privacy, for example. So differential privacy that is used, in various cases, if you’re talking about structured data, or querying a structured database, differential privacy can be used as at the output of the query to a database. So you can add a little bit of noise to if you make a query of how many people in this database, have, you know, green hair and smoke, right, you’ll get an approximate result. And the that means that you don’t know if there was one person, no people, or a certain or 10 people, but what you will get is a result that will allow you to make generalizations about a population. So the noise that’s added shouldn’t be so much that you can’t make generalizations about you know, smoking causes cancer, because you’ll see that that’s something that a lot of the population has in common when they smoke. But you wouldn’t be able to figure out that somebody specifically has a rare disease. And that end is a part of that data set. Differential privacy is also used for training machine learning models. So there’s this great paper by Carlini et al, called the secret chair, where they hide a social security number within the penn treebank data set. And they show how language models memorize, specifically character language models, in this case, memorize the the digits of that social security number that’s hidden, even though it only appears once. Yes, so that as well as other examples that have been brought up of language models spewing out personal data, they showed the need for either one, using differential privacy when training the models, so that adds a little bit of noise to the training. So you don’t memorize the specific information, but rather overarching patterns that you’ve seen multiple times? Yes. Or in what we’ve observed in private AI is you can also redact the sensitive information or pseudonymise. And if it’s not there in the first place, you can’t memorize it.

Paul Starrett
I see. Interesting. Um, does that cover most of the various areas of the there are more there secure multi party computation? Yes. Okay. And I can I can sense a reluctance to go there. I definitely go there. Quite a few technologies, but I’m happy to cover them all. Okay, no, I think I think I’ve heard that one and then federated learning. And maybe if we could touch on those, I think it is a good idea to kind of keep this, you know, focused. That’s the problem is, I’m technical, and I get a lot of what you’re going. But I think that’s probably going to be a bit much. But yeah, maybe secure party. And then federated learning maybe. And then.

Patricia Thaine
Yeah, so secure multi party computation. The idea there is that you’ll have two or more parties, if it’s two parties secure to two party computation. And these parties don’t want to show each other their inputs, but they want to share an output. And some examples of secure multi party computation and use or for example, for genomic data analysis, when hospitals can share data with one another, then they can use secure to party close to party or multi party computation to get that result, ultimately, so the way that it often works is using what’s called garbled circuits. So these are circuits that you can’t understand what they did what the inputs to these are. But you can understand the output. And you have to frame the inputs in such a way that when you’re putting them in the circuit, it combines nicely to make a sensible output. So this is used with in combination with federated learning for machine learning. Federated learning is about training your machine learning models directly on the devices where the information are being produced. So, for example, Google uses federated learning for G board. And Apple uses it for emoji prediction learning. And what you want to avoid when you’re using federated learning is for the resulting weights of the model. So the updates that you’re making to the models to reveal anything about what was, what the information, what they were trained on what kind of information they were trained on. And the way to do that is by combining the models using for multiple users using secure multi party computation, and the secure multi party computation protocols that are available for this. In some cases, they have a set limit of users that need to put in their models for the output to make any sense So that avoids you having you saying, let’s just get these two users models and then get the output and then trying to figure anything out about what the updates were.

Paul Starrett
I see, I see, Wow, that’s a lot and all these different technologies. But it sounds like it, like you said, If they are, each one is useful in different contexts. So for example, differential privacy is for machine learning, maybe synthetic data and such, and in federated learning and secure multi party computation. So when you have multiple inputs from different areas, and then, and then with the case let’s see, I can remember what we discussed, because there’s so much with the idea of homomorphic, homomorphic encryption is you’re sort of querying a database on a on an ongoing basis, repetitive way.

Patricia Thaine
Yeah, that’s one example. So if you have a database of social security numbers, for example, and you want to see whether there’s a match with one social security number, that’s input, but you can also do some light machine learning using homomorphic encryption. But that’s still a very girly area of research.

Paul Starrett
I see. I see. So it really depends on the context. They check. Yes, that you mix and match these things, in order to bring yourself into a place that is optimal for for the, the balance. That brings me my next question. And that is, there’s what I like to what I’ve heard referreed to as a privacy budget. So there’s this there is this sort of balance between the usefulness of the underlying original data where the private sensitive information is in the obfuscation, again, to coin my own phrase, of that data, so that it is that it is compliant to whatever law you have, and whatever infrastructure or workflows you have going on. I think that’s really, where do you that is a very, I want to say subjective, but it’s it is very difficult to pin down a metric to say, what is how much do I take out of the data to make it privacy preserving, compared against how much insights that data provides is lost in that process. What are your thoughts there? Do you? I think one of the things that the listeners are really be keen to know about is, is it fair to say that there’s always a possible net gain with the use of these technologies? Can we always say that for any workflow, or infrastructure or technical topology? That we can say yes, this is always worth doing will always find the right place to put the needle. How would you? How would you approach that thought, if that makes sense, if it made sense to you, that

Patricia Thaine
Totally makes sense, I really, once again depends on the use cases, the where to place the needle is still a research topic, and will continue to be to for many years on end. And on a case by case basis. What new things I’ve seen come up is, for example, this worked by Calico, a mom who does synthetic data generation of unstructured data through Replica analytics. So they have worked on, you know, once once you figure out where to place the needle, how can you have, how can you know, what the identification risk is of the data that you’re creating? So they’ve combined concepts from data, and re identification risk metrics from an anonymization methods with synthetic data generation to have a clear idea of is the data produced, actually privacy preserving. So that gives you an even better idea of where to whether you’re placing the needle in the right spot. And then in other cases, you, you could just make the data privacy preserving, and you don’t lose anything, it really depends on the use case. And this is particularly visible for structure, unstructured data. So images, video, text. So if you want to tell how well the conversation went, you don’t need any information about the participants, you just need, you know, the positive negative words that are present in the conversation. So you can remove names, you could remove social security numbers, ages, none of that is giving you any extra information for your task. And for images. I mean, there are a lot of tasks that don’t even require people to be in there. So that’s already anonymous by default. So things like temperature analysis, or things like relief analysis. And there’s a if you just remove the people from for example. videos of streets and you want to compute just you know how many people are on the street, rather than, you know, extract any information about that individual. You’re already getting a certain amount of information that’s useful. Like, how busy is it? Is there anything funny going on? And you don’t need any personal data there to figure that out. So I think that a lot of the dialogue that’s been happening about utility versus privacy has been on structured data, where you do have the set columns, with the very specific information that you’re trying to get for, like pharmaceutical trials, and very little has focused on what that means for unstructured data. Yes, and actually, that is a that’s a fantastic segue for us. Before we move there just for a second, I would you agree, I think that there’s an aspect of the needle, again, of the privacy preservation versus the utility, the underlying data, that might be a risk question for the lawyers or the compliance folks, where they might say, well, here’s our budget, here’s our risk, here’s our commercial purpose. You know, there’s some there’s some end goals, machine learning, or what have you has some commercial value, whether it’s, you know, with a set of your statements, there’s some question for the the compliance steering wheel. Yeah, um, I think what we need to do as a, as a research community is make that needle more clear. What exactly does it mean, when you have a certain amount of privacy budget, and I do see some work that’s headed in that direction. But really, it’s about understanding the data that’s produced afterwards. And that’s something that has does not have very much research. I see. Interesting.

Paul Starrett
So that’s, that is an area of challenge for. That’s right. Got it. Interesting. That’s very helpful. Great. So I guess we can go then into the the data, which really kind of gets right down into the heart of what your firm does. And that is that you really, the type of data really drives a lot of us. So you have your tabular data, if our audience doesn’t know, that’s like a spreadsheet columns and rows, basically, they tend to be simple numbers, or simple values, like a name or something will be called categorical, it may have text in it too. But it’s generally not a challenge as much. But when you get into like emails, and research documents and other natural freeform text, and you get into audio, and you get into some of these other areas, that’s really a whole different bucket of worms, if you will. So, so if you can just tell us that this is really basically a place for you to discuss what your company does, because I would recommend, as you alluded to these flowchart and get the decision tree you have, not only for the app, but just to see how these things are handled at private-ai.ca. So maybe just cover the different types of data and challenges you’re finding for each of those, especially its text, because that’s an area that I have some background in NLP and stuff.

Patricia Thaine
Sounds great. Actually, we do focus more on text than we do on images and video. Okay, I can definitely focus more on what we’re doing with regards to text. So for text, what we’re doing is redacting direct identifiers, quasi identifiers, and also doing automatic pseudonymisation. So redacting things like names, ages, approximate locations, exact locations, credit card numbers, we have a huge list of information that we can redact. And in addition to that, we also have the ability to pseudonymise completely with regards with information from the context. So it’s not a table where you get means and then replace them. And that becomes really predictable. And also doesn’t work well with the context. But pseudonymisation itself, when done properly context based, one decreases the re identification risk of an individual, because if you if anything was missed in the concrete, and by the identification system, then it’s very difficult to tell what the original data was from the fake data, because of how natural it looks. And number two, it also prevents low lowering machine model accuracy for downstream models. So if you’re doing sentiment analysis, named entity recognition or other tasks with the data that’s de identified, and it actually de identified through pseudonymisation, then your models are better suited for that task if you’re using more natural data.

Paul Starrett
I see I guess a lot of this comes in with just so audience knows pseudonymisation is a term that means basically your pseudo anonymizing it’s not completed anonymiation, but it’s it’s some steps in the middle. But that freeform text and natural language processing and the inclusion of search engines and things like that, is it’s so diverse, the freeform text, that there it’s I think it’s difficult to glean underlying information.

Patricia Thaine
Yeah, so it really depends on the kind of text that you’re dealing with And to your point about pseudonymisation It’s there is this misconception that it is less secure than redacting the information rather than replacing it with names, locations, and so on. And part of that comes from the GDPR, saying that pseudonymisation is less secure than anonymization. And if you look more deeply into that the GDPR actually counts pseudonymisation as they’re having to be a link between the information that’s being used for replacement with the original data. And that link is what counts as making it less secure. So you remove that link, you know, becomes much harder to re identify an individual.

Paul Starrett
I see. I see. Got it. Thank you for clarifying that. Because I think as part of what you’re talking about, knowing earlier, when you said, How do we determine what the needle is actually doing? Before you can have a risk professional say, oh, yes, I blessed this has been compliant. Because, yeah, got it. I think the last few questions would focus then on where do you see the law going? I mean, this this is almost rhetorical. But is, it seems to me as though there’s a real move towards being more strict, and you’re seeing more of this being spread across different jurisdictions. So we’re getting onslaught and you probably a little more familiar with this than I am with the United States, the individual states are now starting to come out with laws. CCPA is now CPRA, which is even more ominous, or onerous, I should say, what’s your sense of where the the law is going? And how do you see I would imagine that would bode well, for our various firms? Where do you see that going? What’s your sense?

Patricia Thaine
Well, we are looking at seeing more and more laws that are taking up more of GDPR standard, and then customizing that for their location. And we are seeing more companies that are just taking the GDPR as the standard regulation that they comply with, and then adopting their practices from place to place. And I, so privacy, by design, privacy, by default, are both part of the GDPR, part of the LGDP in Brazil and part of a bunch of regulations that are popping up. And I think that as lawyers, figure out what these different technologies can do, and where the appropriate scenarios are for these different technologies, we might get more and more refined laws. Or it might move more towards the experts have to validate whether or not you’re doing the right thing. So it’s, it’s really up in the air, where it’s gonna go. But what we are seeing is that grace periods are finishing for a lot of these laws, the GDPR is becoming is going to be enforced more severely in the coming years. And there’s they’re setting up legislate regulatory bodies that will be keeping track more closely, of which companies are doing the right thing. So it will be really curious to see, you know, ultimately, how is this going to affect government budgets for privacy regulations? How is this going to affect privacy research? How much more privacy research budget, is there going to be really curious about how it’s gonna go?

Paul Starrett
Yes. And I would agree, because I think part of this is is educating as I’ve alluded to, the the lawyers and because a lot of lawyers, I am one myself, but I, my my licenses inactive right now, I don’t use it. I can go back anytime I want. But I know generally that they chose liberal studies as a direction because they’re self professed technophobes. So getting them up to speed in a way that that they can make an informed decision, I think is going to be one of the challenges. And because the technologies we’ve discussed have been, you know, even at the superficial level, still somewhat of a difficult to grasp, and it’s no one’s fault. It just, it’s the nature of the beast. Let’s see. So I think then, I guess the last question, I might put you here is, I think there’s a certain cost to do research and development, there’s a certain cost to build the infrastructure for testing and for production when for the audience production meaning, what goes actually into the enterprise infrastructure to help them in their real world systems, maintain price, privacy preservation, there’s a cost to all that. And that’s going to somehow I think, guide, how well we can operate as how well and enterpirse can operate, you know, in whatever he wants to do commercially. So there’s a certain barrier to those purposes. Do you think that these technologies will remain cost effective, in general for the purposes that they’re, that they’re meant to?

Patricia Thaine
Yeah, that’sa good question, Paul, I there’s normally a barrier in cost, but also a barrier in education. So there, there really isn’t. There aren’t that many privacy courses in technical colleges, or in computer science departments at the moment that is growing in quantity. But there’s still quite a bit of an education barrier when it comes to figuring out which privacy technologies to integrate, and how to integrate them properly. So there is that huge bit of onerousness that companies need to deal with where it’s very hard to find the talent very hard to figure out when to integrate different privacy technologies. What I’ve noticed a lot is data protection officers in some companies get empowered to go and talk to the technical teams get into the nitty gritty. But in some cases, they’re the people that they’re the technical teams go to, to say yes, or no to a particular problem, but then don’t work with in order to find a solution to get to a yes, in some cases. So there’s that barrier, both in education costs and internal politics, all of which need to be solved in order to get to the next level. What we’re trying to do at private AI is make it a bit more cost effective for companies to be able to to do this also more secure than having your engineers with no privacy training, working on privacy problems. And what I see happening is, there is more open source software that’s going out there that some some software engineers can take advantage of. In a lot of cases, there are more privacy companies that are around to help you guide you. But there’s still this really big gap even when it comes to open source software, or easily integrating privacy into your software pipeline. And that’s really where the market needs to move.

Paul Starrett
Interesting. Interesting. Great, I think, well, that’s a good answer. Because I think that’s really kind of where people are going, what is privacy preservation where, you know, this is a solution, Frankly, I think is overall very much in that game. I think it’s a hockey stick, if you will, kind of floodgate that’s, that’s going to happen. I think, I always give every person I have a podcast with the opportunity to say anything they want to to the audience that we haven’t discussed, that you think might be important for them to know. Hmm, interesting. Let’s see. Take your time. We’d love to, we’d love to let you take the time to come up with everything is worthwhile. Okay, let’s see. Or to emphasize,

Patricia Thaine
I I’d like to point out that the work Twilio is doing for privacy is really spectacular. I’m a huge fan of their data data protection officer, Sheila Jambekar. And she goes, so we did a web webinar with her a couple couple weeks ago. At private AI that’s on YouTube. You can take a look at that to see how they’re dealing with privacy internally within the organization. And I think that they’re quite a leap away from most companies as to how their privacy team interacts with their engineers and it’s definitely worth taking a look at.

Paul Starrett
Interesting. That’s Twilio. Twilio? That’s right.

Patricia Thaine
her name is Sheila, Jambekar.

Paul Starrett
Do you mind spelling that real quick? Yes.

Patricia Thaine
I just want to make sure that I’m getting Sheila Jambekar and Twilio.

Paul Starrett
Got it. That’s really good, because that really gives people a good sort of bellwether, if you will. I’m an organizer of the SF Python Meetup group here in San Francisco. And one day we had Twilio came in, and we did some Python programming in their, in their environment. And I was impressed. Yes, very good crew, very into education. I noticed two very big on that. Listen, I’m going to close out here with a little bit of a discussion of PrivacyLabs and then I we can let you go and finish up. What we do is we help to unify all these different moving parts. It’s you know, the synthetic data, the homomorphic encryption, the cybersecurity, the cloud, on premise, all of those things. One of things we do we also work to to perform audits of artificial intelligence and workflow automation and also with just the holistic aspect of what’s happening. So I could definitely see from what you said today how this, if you pull on one thing, it tugs at many different things, and you really have to consider those things. So I think that I will end in getting in touch with you, would it make sense for them to go to website and then contact you that way? Or did you want to? Do you have an email address you’d like?

Patricia Thaine
Feel free to email me. My email addresses patricia@private-ai.ca.

Paul Starrett
Got it. And my mother’s name is Patricia’s great. Yeah, smart. Yeah. Okay, and I’m Paul Starrett of PrivacyLabs. You’re already on our website if you’re listening to this. So um, thank you again, Patricia. I really appreciate it and very, very useful and helpful and hopefully we’ll have you on another one. Soon here.

Patricia Thaine
Would love that. Thank you so much, Paul. Really, Thank you.

Share this post

Share on facebook
Share on google
Share on twitter
Share on linkedin
Share on pinterest
Share on print
Share on email