A minor but important correction. Krebs wrote that the Gov claimed that “fixing the flaw could cost the state $50 million.” That’s not quite right. In the press conference linked in Kreb's post, the Governor actually claims that the “incident alone may cost Missouri taxpayers up to $50 million.” I’d guess this number includes an estimate for the legal cost of dealing with the data breach plus any statutory penalties the state might incur (plus a grossly inflated price for fixing the bug).
I wish the authors wouldn't oversell the privacy claim:
> Github: "The DeepPrivacy GAN never sees any privacy sensitive information, ensuring a fully anonymized image."
> Abstract: "We ensure total anonymization of all faces in an image by generating images exclusively on privacy-safe information."
> Paper: "We propose a novel generator architecture to anonymize faces, which ensures 100% removal of privacy-sensitive information in the original face."
Changing a face anonymizes an image the same way that removing a name anonymizes a dataset -- poorly. This is cool, but it's not anonymization.
"This is cool, but it's not anonymization." Isn't it?
For clarity it might be good to establish what I mean when I talk about three terms: "identifiable" is either the original, encrypted with the key available, or a hashed version or bloom filter (or so) of low-entropy data such as email addresses or phone numbers; "pseudonymous" is replacing the data with a unique but disconnected value (e.g. a UUID, or encrypted with a random key and key destroyed); and "anonymous" is either no data, or data that has no relation to the original.
As far as I can tell, this algorithm replaces the data with a random value that has no relation to the original. I understand that if we have a list of HN comment metadata and you remove the usernames ("anonymize"), you can still find me by the time of posting correlated to DNS request logs at the ISP. In the case of pictures, I guess the place is usually identifiable + the time is known, thus you can potentially piece together who was there at that time, corroborated by the presence of a certain backpack or shirt.
Is that what you mean, or is there something else that makes you say it is either still identifiable or pseudonymized rather than anonymized?
> ... this algorithm replaces the data with a random value that has no relation to the original.
Based on that sentence, I assume that when you write "the data" you mean "the part of a picture corresponding to a person's face."
But removing the face from a picture doesn't necessarily make it particularly difficult to identify the subject if the subject is very familiar to you. It doesn't matter if you've never seen that specific picture, or if you have no additional context like place and time.
Just look at the examples on the GitHub page for proof! The picture of Obama and Trump is clearly recognizable, and at least one of the other Obama photos is easy to recognize. The soccer players are identifiable from their jersies (Messi is #10 on Barcelona). Jennifer Lawrence was also easy for me to spot.
Fair enough, if you know what someone wears, their exact skin color, build, and perhaps even the place they are in, then sure, blacking out the face (or changing it for that matter) won't help. I guess I agree that this is more common than the authors make it sound (it's indeed not 100% guaranteed absolutely anonymous always ever, as they put it). But I do have to say, this is about as good as blacking out the face completely and a lot less obnoxious.
> The picture of Obama and Trump is clearly recognizable
You sure? If I show this picture in isolation to someone https://snipboard.io/VjwEc1.jpg I'm not sure that they will say it's Obama. Not sure there is a politically correct way of saying this, but there aren't that many people that are well-known by billions with that skin tone and in a suit, so of course if you ask them "the face was changed, who is this?" they can do a lucky guess for Obama because that's the only guessable possibility.
It's true as long as the function f is sufficiently "shrinking." The domain of the function (where the x's live) must be sufficiently larger than its range (where the f(x)'s live). For example, if the domain is size N, a range of size 0.99N is enough to guarantee that collision resistance implies preimage or second-preimage resistance.
Said another way, if there are many collisions and you still* have a hard time finding them (collision resistance), then you can prove that it's also hard to find preimages or second preimages.
Your example, f(x) = x is not shrinking at all: there are no collisions.
A fundamental property of hash functions is that they're shrinking---so much so that it often goes without mention in informal settings. Hash functions are typically defined in two ways: shrinking arbitrary length inputs to a constant length (e.g., n bits to 256 bits) or shrinking arbitrary length inputs by some constant amount (e.g., n bits to n-1 bits, or n/2 bits). Even shrinking by one bit serves to halve the domain, guaranteeing many collisions and ruling out counter-examples like the one you gave.
Rogaway and Shrimpton specifically used collision resistance not implying preimage resistance as an example of the importance of definition and assumptions:
>Informal treatments of cryptographic hash functions can lead to a lot of ambiguity, with informal notions that might be formalized in very different
ways and claims that might correspondingly be true or false. Consider, for example, the following quotes, taken from our favorite reference on cryptography [..] "collision resistance does not guarantee preimage resistance" - [0]
They go on to show the definitions under which collision resistance does and does not imply preimage resistance.
The paper you cite is a good one, but it's actually demonstrating that the person you're responding to is correct (and you two are agreeing). In fact, Rogaway and Shrimpton specifically state that their constructions may appear somewhat contrived; this is because collision resistance does imply provisional preimage resistance, and in the real world it's quite difficult to construct (useful) collision resistance without preimage resistance.
So to answer your original question succinctly: collision resistance implies provisional preimage resistance, which is the setting for most real world hash functions, including post-quantum hash-based signatures.
You're certainly right that formal definitions are important. However, on this forum, I think informality can be appropriate. Though there are variations and inconsistencies, in the theoretical cryptography community, second preimage resistance is most often formalized as "universal one-wayness" and preimage resistance is formalized as "one-wayness."
I did however was careless when I claimed that shrinking by 1 bit suffices for preimage resistance. The hash function needs to shrink by at least log(n) bits to rule out computationally-bounded adversaries finding preimages.
Also, apologies for the formatting of my OP - I don't post here often.
K-anonymity provides very little protection, if any. A few brief points:
1. I've never seen a formal definition of security that k-anon supposedly satisfies. While I personally really like formal guarantees, maybe one might argue this wouldn't be so bad absent concrete problems with the definition. Which leads us to...
2. K-anon doesn't compose. The JOIN of 2 databases, each k anonymized, can be 1-anonymous (i.e., no anonymity), no matter what k is.
3. The distinction between quasi-identifiers and sensitive attributes (central to the whole framework) is more than meaningless: is misleading. Every sensitive attributes is a quasi-identifier given the right auxiliary datasets. Using k anon essentially requires one to determine a priori which additional datasets will be used when attacking the k anonymized dataset.
4. My understanding of modified versions (diversity, closeness, etc) is less developed, but I believe they suffer similar weaknesses. The weaknesses are obscured by the additional definitional complexity.
1. As I said most people don't use plain k-anonymity as it can leak information about the sensitive attribute when the values of this attribute in a group are (almost) all the same. This is why extensions like l-diversity and t-closeness exist: l-diversity ensures that in each group there will be at least l different values of the sensitive attribute, t-closeness ensures that the resulting distribution of the sensitive attribute values in a group is close (as e.g. measured by the "earth mover's distance") to the distribution of the sensitive attribute in the entire dataset. Given the original data and the anonymized data sets it's pretty easy to measure the information gain (e.g. using a Bayesian approach) of an attacker if he/she knows in which group a given person is. In that sense k-anonymity (with l-diversity/t-closeness) can be analyzed in a formal context just like e.g. differential privacy.
2. Yes that's what I mentioned at the end, k-anonymity is not different from most other techniques here: If you use differential privacy with the Laplacian mechanism and repeatedly publish independently anonymized versions of the same underyling data you will leak information (as an attacker will be able to average the released values in order to get an estimate of the true value).
3. Yes sensitive attributes are often quasi-identifiers as well (at least in combination with other quasi-identifiers), they are treated differently because the underlying risk model does not regard a (non-sensitive) quasi-identifier as something that needs to be protected. Inferring e.g. your gender from your zip code, age and body weight using an anonymized data set is (usually) not considered problematic, whereas learning that you are HIV-positive would (almost always) be problematic, hence the distinction. Also, sensitive attributes are treated as a group when applying k-anonymity, i.e. if we have two binary attributes (HIV, Syphilis) one applies the anonymization criteria to the combinations of the attributes ((true,true), (false, true), (true, false), (false, false)), not individually to each attribute (as this can cause information leakage).
4. I honestly don't know what to reply to this, as l-diversity/t-closeness are well specified methods that were designed to overcome the (known) limitations of k-anonymity. Yes, these methods are not completely trivial to use, but if used correctly they can provide good and quantifiable protection. Not using them since they are hard to implement correctly is like saying we shouldn't use cryptographic algorithms like RSA because it's hard to get all the implementation details right.
Source: https://www.apple.com/iphone/compare/?modelList=iphone-13-mi...