For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | redsaz's commentsregister

Thankfully the episode's interpretation of a Good Samaritan Law is completely wrong, at least in many (most?) US states. They protect citizens acting in good faith to help somebody, not like it was interpreted in the episode to require somebody to help.


Not completely wrong. There are duty to rescue or duty to report laws, and they too are very commonly referred to as Good Samaritan laws even if that more commonly refers to a different concept of liability.

https://www2.law.ucla.edu/Volokh/rescue.htm

Duty to report are on the books, including in Massachusetts (ch 268 §40), requiring bystanders to report crimes, but not necessarily render aid. Duty to report laws are rarely enforced, which is what makes the Seinfeld setup more ironic. Although MA has one of the broadest duty to report laws (covering armed robbery and manslaughter in addition to sexual assault), the max penalty is a $2500 fine, but other states have jail time penalties. Duty to rescue is most always only for special relationships, such as parent/guardian/child, spouse, employer/employee, passenger, not random bystanders.

https://www.findlaw.com/legalblogs/law-and-life/in-which-sta...


How should you conduct the interview, then, if:

... Classic interviewing techniques of "explain how X algo works" or "write code to solve Y" will unfairly bias against interviewees that don't test well under pressure, but would otherwise be a good coworker.

... "Teach me something interesting to you" or "tell me stories about past experiences" will unfairly bias against interviewees that are shy, soft spoken, or are mildly socially awkward, but would otherwise be a good coworker.

Given the above awareness of where bias can emerge, how should the interview be done in order to get a candidate that knows what they're doing, and works well with the rest of the team?

Other comments mention relying more on recruiters and referrals, but that isn't always an option.


I don't see what useful signals are supposed to come out of an interview. If referrals and recruiters aren't an option I'd probably try to skip the interview altogether and go with a long probation period (3-6 months). Or possibly have a short 20 minute free-form interview talking about their last day job and expectations of the new one with a very short list of major red flags ("doesn't want the job", "unable to form sentences") then block candidates who raise them.


- How do they take a business problem and model it into code - How do they debug their own code - Is their code easy to read - Do they name their variables/fields/methods/classes in easy to understand and consistent ways or are the names confusing or inaccurate - How do they take constructive criticism - How collaborative are they - Do they think about the problem first or do they just start hacking away - When asked to add a feature to existing code, do they start hacking or do they write out a test describing the new functionality first - When confronted with vague requirements, how well do they ask questions to get the information they need - How much experience do they have with algorithms, database design, systems design, building things so they scale well


If it were possible to work all that out in the interview then there wouldn't be any bad hires.

As a wishlist I like it, I just don't see how you're going to assess all that in an interview. You'll notice that the technique of the day ("teach me something") doesn't address any of the dot points and that holds for ... pretty much any technique. Interviews are a weak process for assessing anything.


The long probation approach completely ignores the very real costs of onboarding someone new.

It takes time, money and people to bring someone in, and hiring is actually quite a risk for many companies (unless they're huge and/or in a hiring frenzy).

If a candidate doesn't work out, that's a lot of time and money down the drain, and potentially lost work, and disruption to teams and timelines, etc..

Most people don't get this part, and I think that's why they don't understand why interview processes are structured the way they are.

You really want to do the best possible evaluation, on all fronts, at the start. The longer a bad candidate stays in your pipeline or company, the more expensive and disruptive it gets.


tl;dr: Neanderthals needed to go clubbing more.


Heaven forbid we encourage people to have fun. I personally like my jail cell and prison clothes.


> The only thing missing is a game

I've seen this story of "I want to make a game" [proceeds to make a game engine instead] happen in my own life (my engines were never any good or complete though), and in countless other programmers lives.

It may be the trap of thinking that "If I get the hard part out of the way first (which is writing the engine code, right? Right? Anyone?), then the rest of the game making process will be easy" that gets me.

Or maybe it's finding out along the way that it was more fun to make the engine than the game itself: "check it out, I completely redid the particle effects and I can now do 100x more particles at 60fps, how cool is that?"

There's way more easier-to-see improvements in the making of the engine, than in the making of the game itself, and so we (ok, I) keep optimizing the engine because those are quick dopamine payoffs compared to the slower payoff of having a polished game that's actually fun for the target audience to play. Sure, I might tell myself that the game I want is only possible once I have the engine first, so I'd better concentrate on that before making the actual game, and there's some logic to it. But without a clear idea of what the game will actually be, it's easy to fall into the trap of endlessly adding and refining features, rather than actually try and use those features in anything beyond a slick demo.

To combat the tendency of only making an engine rather than a game in my latest hobby project, I picked an already existing engine (Phaser js) and tried to get something interactive on the screen ASAP "with the stupidest, least designed code possible", and it mostly worked to get me a playable (ish) game! Granted, it's a knockoff puzzle game but hey, I sometimes find myself "playtesting" it instead of what I should really be doing, which is refactoring the code for what I'd like to have it do next, so I'm marking it as a win.


Indeed, making an engine is a guaranteed reward. You put in time, make an engine that can draw something, and you succeed, receiving a dopamine reward. But making a game is really risky; a programmer may put in a lot of time and hope, then still fail. I think lots of programmers don't want to take the risk of failure.


> empirically measured that completing a task... is twice as fast in [Rust] than in [C++]

I have not read up on which tasks you're referring to that are empirically measured, apologies. The reason I'm curious on what the tasks are, is that depending on the task, navigability may not matter.

For example, if the task is "build a tool that does X", then navigability of the code does not matter. Once built, the tool does X, and there's no reason to revisit the code, and thus no reason to navigate the code.

But if the task is "Given a tool that already does W, X, Y, make the tool also do X', Y', and Z", then navigability of the code matters. This is because the coder must understand what the tool already does, and where the changes need to be made.

Most of my professional life, (and I'm willing to bet, most other coders here as well) I more often find myself in the second task than the first.

But, I'm not interested in Rust vs C++. I'd be more interested in the results of "given a version that makes high use of type inference vs not, how quickly can someone new to the project add X', Y', and Z." That would be a more appropriate test for what the author describes here. And I'd imagine that probably, those that are using sufficiently advanced IDEs would beat out those without, regardless of if type inference used or not, and would probably be slightly faster when given the highly type-inferenced version.


> This post can definitely be considered a “religious” opinion piece

The author certainly has that right, because the post steps on two programming religion landmines, from how I read it:

1. strict static typing (without type inference) is good. 2. code should be written to allow IDEs to enhance navigability, rather than written on the assumption that IDEs will be the sole provider for navigability.

I believe there is a point to be made in the "when we don't know what we're getting back, that harms navigability" camp. But as another commenter posted, there's a point to be made in the "when we overspecify what we're getting back every time, that can harm readability, too" camp.

I can't express where this balance is. It's somewhere between poetry and a legal document, the prose where you can really get into a good book and enjoy the world that the author presents. Some people really like the beauty of a short poem. Other people may require precise wording that leaves no room for enjoyment or interpretation. The rest of us can have the majority of fun somewhere in between.

Where that "in between" equivalent would be in my day-to-day programming, I'm not entirely sure, because what I'm writing could be a short script where brevity is vital (poetry-ish) vs some section of unfortunately highly complex code with lots of tests for edge cases (legalese), and all the other code where I'm still world-building and conveying ideas (prose). And I believe that complexity should be spelt out as precisely as it can in the code itself, rather than rely on the hope that somebody else is using the same IDEs and features as me. I've tried using type inference where it seems fine to use, and then spelling out the exact type that a variable wants where it isn't clear what might get returned, all in the same app, but it comes across as sloppily inconsistent in my mind. Ah well.


We have a similar saying: "You can lead a horse to water but you can't make 'em drink." At least, it seems similar to me.


Oh yeah, that's exactly what I mean. Thanks for reminding me the English / American idiom.


Answer: because they don't want to be promoted to manager, they'd rather be coding.


that's just another myth spread by unskilled managers.


Writing a program like this is one of the first exercises I give myself when learning a new programming language, because it touches a little bit of everything (reading files, output, CLI, using libraries, hashmaps, functions, loops, conditionals, etc) and isn't too onerous to implement.

My latest (it's a few years old at this point) is lsdup (rust version) using blake3 for hashing the content: https://github.com/redsaz/lsdup/

All it does is list the groups of duplicate files, grouped by hash, groups ordered by size. I'll usually pipe the output to a file, then do whatever I want to the list, and run a different script to process the resulting list. It works fine enough.


Checking the file size first is good, but reading every byte of every file that has at least one file of matching size and and doing a ton of XOR steps (or whatever BLAKE3 is) for all those bytes can't be optimal when different files probably differ within the first few bytes.

If you only have two files, read them X bytes at a time in parallel and XOR the bytes directly, stopping at the first different byte. For more than 2 files, to a first approximation, you have your list of file pointers, you read the first X-number of bytes for all of them and then sort your list in-place based on those first X bytes (sorting is a linear operation if all the X bytes are the same), then you iterate over that sorted list processing each run of identical X bytes in a depth-first fashion. If you have just one element that starts with a given set of X bytes, it's unique and you don't have to process it anymore, otherwise repeat the process reading the next X bytes but only for the files that started with the same bytes. X is probably 8, so that you can efficiently XOR 64 bits together for your comparisons.


I've considered adding a "just check the first {user-configged-bytes}" mode, which would offer enough of the speed up you describe, which I think jdupes does (maybe? It's been awhile since I looked at it). I think the speed of opening the file and reading the first buffered page of bytes would dominate the time of the operation, especially if one were to do async reads (which my program does NOT do. I should look into it.)

Worth some performance measurements, anyway!


(Warning: rambling ahead, since in the past I've spent a decent amount of time on the same problem)

> using blake3 for hashing the content

Using any hash algorithm isn't a good design, at least for SSD storage.

First, it invites some degree of cryptographic risk. i.e. if a collision is ever discovered in the hash then your program can be tricked into discovering a false duplicate. Whether that is a problem depends on the use case, but isn't ideal.

Worse, though, is that it just doesn't make any sense algorithmically. Consider the simple example of having two 1TB files and you want to discover if they're identical. You could do a cryptographic hash of each of them and (barring any malicious collisions) tell whether they're the same.

However, now imagine that those two files differed in the very first byte. Now it seems that you could have figured out they're different a lot faster, right?

So what you really want to do is read both files chunk-by-chunk (probably some number of disk blocks at a time) so you can detect the files-are-different case early. (After all, the common case of files that differ is that they'll differ early!). You could still compare the chunks using a cryptographic hash, but now there is no benefit: you can just compare the two blocks of memory directly faster than you can take a crypto hash of them. C's memcmp() works fine but since you are probably working on fixed-sized aligned blocks you can do slightly better with a hand-rolled SIMD loop.

The one advantage that a cryptographic hash gives you is that it provides a memory-efficient way of reading all of file A and then all of file B. Therefore if disk seeks are expensive, it can be a benefit (again, if you can accept the risk of a malicious false-positive). However if the files are SSD backed and you have enough RAM to read a decent sized chunk of each file into memory simultaneously this ceases to be a problem.

To extend this from 2 files to many, first stat() all of the files and group them by file size. After all, two files of different sizes aren't going to ever be equal. You can think of the size as a "hash" of the contents that you get for free. Any files that are 0 bytes are (of course) duplicates of each other so you can just return those as "hits". Any file that has a unique size (which thankfully is often the common case) is not a duplicate of anything and you don't even have to open it. If you care about hardlinks, you also want to track the inode numbers at this step so you can avoid comparing two files that are actually the same inode.

Then for any group of files with the same size, read each block in turn. The tricky part is that you want to subdivide the group by the contents of the file. i.e. if you have 4 same-sized files and two have contents "AAAA..." and the other two have contents "BBBB..." then you didn't find any unique files yet, but you need to split the set of 4 files into two new subsets of 2 files each. Data-structure wise, keep a worklist of items containing of (1) a set of (at least two) files that could be duplicates and (2) how many bytes you've already verified are the same among them. Then when you encounter this "split" scenario you can just push a new worklist item and continue working on one of the groups.

The bit you need to be careful of is not introducing bad worst-case performance here in a hard case (e.g. you have a million potentially-duplicate files, and reading one block separates them into 5000 groups each with 200 members). Just a decent hash-table is enough with the key being the whole disk block.

Because maintaining this hash table adds a bit of complication, it can seem worthwhile to build a special-case for groups of exactly two files, where you can just do the simple read-and-compare. But then you can re-combine this by observing that the case where you have N files that are all the same is also worth optimizing for. So instead, just do the read-and-compare on all of the files until you find the first two that are different -- only then start building the hashtable when you have two different blocks and more files still to read. That way the common-ish case where you have many files all the same can be handled as fast as possible.

There are things that operating systems could provide that would make this even better:

1. It would be nice if it were easy to estimate the likely seek cost before picking an algorithm. If the file system would simply indicate whether it thought it was backed by spinning rust that would be great.

2. Also if you could ask for the filesystem to read a non-fixed number of bytes (without resorting to async-I/O) that would be helpful. ("Give me up to 1MB of the file, but stop early if you have to seek to a new extent"). Having the ability to basically read the file extent-by-extent instead of block-by-block would mean we could be seek-efficient while reading multiple files in parallel.

3. Finally, it would be great if there was some portable way to access any block-hash metadata the filesystem itself keeps. A filesystem that does its own deduplication work might already know that two blocks must have different contents without reading them because it already scanned them. On the flip-side, if the filesystem supports copy-on-write file snapshots then it could tell us in advance that two inodes are really the same underlying file before we open them.


So, my deduplication is about merging various archives I have of various things I've pulled off of 4chan since 2005. My algorithm is a bit backward because it starts with the proposition that all files are the same. This will turn out to save on comparisons.

Differentiators start with file size difference (obv), then an MD5 of the first one percent of the file, then a SHA-1 of the first one percent of a file, then a MD5 of the first ten percent, and so on. The byte-by-byte comparison is the last ditch effort. A differentiator is only triggered by a having two or more files together in a subgroup, and the results are stored in a database.

So we start with a massive group of all of the files. Then subgroups are made by file size. Some subgroups might only have one member and so we stop there. If not, we start with the MD5 of the first one percent ...

I will probably work on image matching, eventually.

The other reason I made it was so that after the dupes were detected, I wanted custom rules as to what to do with them.

I know Microsoft had a metadata dream about files and while I don't disagree with it, most people just ... don't do it or they do it inconsistently. I've worked with librarians, people who would agree on where to put a given book in a vast series of shelves, but when it comes to digital works, they get all sloppy. I think one of the better possible frontiers for AI is tagging out documents and images. But it's still quite aways-away. Just as an example, one would think that Microsoft would have a spellchecker for filenames by now.


I like the strategy of only hashing the first part of a file as a multiphase approach of deduping in order to quickly eliminate unique files, I wish I had done it that way with my util. Maybe for v2!


I think my next pass at it will be to merge the MD5 and SHA-1 steps so that we get two outputs at the end, this way I would save on the file-reading.

But after that, I think one percent of the end of the file. Then ten percent of the start of the file, then ten percent of the end of the file ...

Given that metadata is typically at the end or beginning of a file, those seem like the best place to look for differences.

I would be open to other hashes so long as they were drop-in easy. I'm not concerned about malicious, forced collisions because they would have to overcome two different kinds of hashing, and the most it would earn is a delay, since there's always a byte-by-byte comparison at the very end.

I suspect I would similarly want to use multiple fingerprinting methods for the visual characteristics of an image file.


> I'm not concerned about malicious, forced collisions

Consider taking a look at xxh3, possibly. It seems a pretty decent contender, hashing speed-wise, to trade off of secure hashing: https://github.com/Cyan4973/xxHash


This isn't the first time I've heard concerns about using hashes for checking file equality. I've considered adding a "paranoid mode" to do the direct byte-for-byte checks for such folks that don't even want to introduce a so-remote-it's-virtually-impossible theoretical chance for a collision to occur.

I'd go into the math about how remote of a chance it would be (barring any discovered hash collisions) but others have explained it better than me elsewhere.


"The math" only matters for random collisions, which are effectively impossible (less likely than the CPU malfunctioning). However that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

But as I tried to describe (probably in way too much detail) the real problem with "hash everything, compare hashes afterwards" is that it implies that you'll be doing I/O to read all of the file's contents even when it isn't needed to prove uniqueness. For a lot of common use cases (big files, few dupes) this can mean doing 1000x more I/O than you need.

Once you design the solution around avoiding unneeded I/O, you find that hashing also stops being useful.


> that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

This is what I meant by "barring any discovered hash collisions" but in retrospect I didn't make that clear enough.

Though, if you're crafting your own malicious different-content-same-size files and storing them in your NAS to cause a hash collision to make them appear the same, then I bet several governments are willing to pay top dollar for your abilities :D

Or, different scenario, say you're hosting a Dropbox-like service and you're storing files for hundreds of thousands of users, then you shouldn't be using a duplicate-file-finding util anyway, it'd be better if it was implemented at a different layer anyway.

The scenario you describe (lots of big files of same sizes, few dupes) I agree hashing the entire file would be wasteful. From my experience on my file server, when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Put another way, if multiple files of the same sufficiently-large size are encountered, expect to read the entirety of those files anyway, whether hashing or checking byte-for-byte, because they are likely dupes. So, there's still potential for perf gains by avoiding hashing, but I'm willing to bet it isn't as much as one would hope/expect.

(You do have me curious as to how much difference it could make, though)

Edit: I'm also willing to admit that I have so many dupes because my backup strategy is TRASH and I have dupes everywhere, and so my scenario could be more unusual than other people.


> when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Yes, if the number of files is small enough, then "notice unique file sizes" is really the only optimization that ends up mattering much. If you have a few thousand files and they're each multiple gigabytes then hopefully you'll get lucky and no two will have the same size.

But the ideal tool should also try to handle the opposite case well too.

First, imagine if you have a huge collection of unique ~100KiB files. Now the "birthday paradox" means that size collisions are inevitable so optimizing the total I/O needed to prove two files are different starts to belp.

But the pathological case is even worse -- what if nearly all of your files are about the same size? For instance, suppose you have a program that is recording time-series data from a sensor and rotating the file every time it grows to 10MB. This sort of thing happens all the time dealing with scientific data sets -- you might have a directory with thousands of large files, all exactly the same size. If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.


> But the ideal tool should also try to handle the opposite case well too... a huge collection of unique ~100KiB files.

and

> But the pathological case is even worse -- what if nearly all of your files are about the same size?... If you want to quickly verify none of the files are dups, reading one block from each is far more efficient than hashing them all.

I agree this is a worthy consideration. No sense in reading the entirety of each of those files when only reading the first block would do, in order to remove the uniques early. If I were to redo the util, it'd probably be something like:

1. Group all files into lists of same file sizes.

2. After all files are read, eliminate any groupings with just one file, these are unique files

3. Read first N bytes to pare down files in those lists, (so now the key is filesize and hash-of-first-N-bytes or even filesize and first-N-bytes if N is small enough, either way)

4. After each filesize-group is subgrouped by first-N-bytes evaluation, eliminate any subgroupings with just one file, these are unique.

5. What remains are files fairly likely to be duplicates.

5a. For users that consider this "good enough", allow them to stop here. (Some deduper tools do this)

5b. For everybody else, in order to make sure the files are dupes or not, the files can next be subgrouped by fully byte-for-byte comparison and/or hashed, whatever the user is good with.

6. The remaining groupings of two or more files are dupes.

In the end I opted not to go for this rewrite, at least not yet, because I got sidetracked thinking about how the whole reason I'm doing this in the first place is because the way I've backed up data across all my machines for years is pretty horrible, all things considered, and now I've got my wife's data to consider too, and I still want my data to be locally available on my laptop, and I don't want to entirely rely on cloud services for syncing, and, and, and... so now I'm making a tool for all that. And then when it is finished, somebody can come along and say "you could've just used owncloud and syncthing and rclone and a pair of external drives, good grief man."

Still though, I might rework the deduper logic anyway.


I've not heard that version before, I like it. The way I usually hear it end:

So, he yells out "102!" and... Crickets.

"What'd I do wrong?"

"Ehh, you must not have told it right."

(...Or, in this case, "you're probably not using the right model GPU")


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You