For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more llambda's commentsregister

They could and probably should offer semantic search, which would be far more powerful than searching exact match keywords.

If you could identify podcasts that often talk about a domain more broadly, you'll have a higher hit rate and overall a better audience fit.


I’ve been creating a semantic search using embeddings tonight against my own podcast transcripts. I’d be happy to have my own content surfacing mechanism like this!


I couldn't agree more.

The fact is unless these models are not what they promise to be, there's no way of weeding out generated content from not without meaningful false positives.

On the other hand, if the models are in some way deterministic, then it stands to reason there's some plausible method of detection.

But if that's the case then there's a tremendous flaw in the foundation of these techniques...they aren't at all what's been promised.

So far that doesn't seem to be the case.

Finally it's worth pointing out that Google et al really shouldn't care if content was generated by a machine or not. The problem isn't content creation, it's content quality. So what Google actually cares about is delivering a result that meets a certain quality threshold. It's very difficult to believe Google cares much about the process of content creation.


How does this compare to Hop[0]?

[0] https://github.com/phaazon/hop.nvim


Just from skimming the docs, it looks like Leap's target keystrokes are more directly tied to the buffer's text (2 char prefix + optional discriminator), while Hop generates arbitrary labels... I can imagine the former feeling much more fluid, but haven't actually used either.


Hop user here, and have just given leap a quick test.

The difference is that leap displays all labels needed immediately after the first search key has been typed, whereas hop (more typically of jump plugins) progressively changes them as you type (depending on which hop command you're using).

The idea is that reduces the tiny delay while you identify the label on the jump target you're looking at. Say you're searching for one 'function' amongst many. With leap (default shortcuts), the moment you've typed 'sf', everything you need to jump to your target is immediately displayed. With hop, after invoking the command (with a binding to, say HopChar1), each search-narrowing keypress creates new labels which you have to then identify and type.

Seems promising at first glance, but only more usage time will tell if it's worthwhile.


I actually use both (though I use hop more often). I use hop to find a specific line, but leap (or lightspeed) to jump to a specific character.


I use hop only for specific characters using `HopChar1`.


To address a point the author makes: I’m entirely unconvinced the “shift left” mentality of data democracy (aka business operators should write sql) is actually shifting left or a worthy path to pursue for most businesses. More recently this 2010s fad seems to be dying and in favor we’re seeing centralized data efforts that produce data products.

One of the most significant pitfalls of data is failing to interrogate the value it provides and assuming that if you give everyone access all the time the magic will happen. The truth is value does not simply materialize just as value does not magically spring from computers by a human powering it on (okay sure, you may have already automated the value but that’s actually the point I’m about to make). In both cases it requires an experienced practitioner who collaborates with a larger team to intersect their work with the business needs.

Data is tricky, all the more so because it’s often seen as a panacea by business leaders who aren’t connected with the work of extracting that value.


With all credit due to Google's excellent and under-appreciated paper Machine Learning: The High Interest Credit Card of Technical Debt [1], I submit that Big Data is the high interest home equity line of credit of business operations debt.

It's not that big data tools aren't useful. It's that, when you just start amassing huge piles of data without a clear up-front plan for how it will be used, and assume that a whole bunch of people who have never heard of sampling bias or multiple comparisons bias or Coase's Law [2] can figure out what to do with it later, you're setting yourself up for a Bad Time.

  1: https://research.google/pubs/pub43146/ 
  2: "If you torture the data long enough, it will confess."


I'd say that Big Data is the Collateralized Debt Obligations of business operations. It looks fabulous from afar but it can blow things up quickly if there's no understanding of the internals.


Yet, we abide by data-oriented conclusions outside of software engineering all the time. From Academics papers to FDA to crime statistics.


I won't say any of those are perfect. But there's at least a little more effort toward responsible data analysis in academia. The FDA brings an interesting example to mind. Take a look at how, on paper, drugs suddenly magically became less effective when the FDA started requiring clinical trial pre-registration in 2007.

It's also worth noting that, over the past few decades, most academic fields have been getting increasingly skeptical of the value of correlative research on pre-existing data sets. Even among people who have been extensively trained in how to do it properly. And yet, the vast majority of big data business plans I've seen in practice boil down to "collect a huge data set and then let people do correlative research on it."


Agreed, I want more scrutiny than some entity flashing “Here is the data”. It can easily be exploited behind the veneer of data-based-credibility.


> I submit that Big Data is the high interest home equity line of credit of business operations debt.

I like this but it's kinda like the payday loan of business operations.


>that if you give everyone access all the time the magic will happen

There's much ongoing discussion about this is the data world, often revolving around "self-service analytics".

Unless you're talking about "our analysts don't have to clean data all the time", which, for a large enough organization makes sense, "self-service" for non-technical folks is futile and pointless. They need specific answers to specific questions, not the ability to infinitely explore the data. Organizations should desire that kind of focus, not prevent it.


They idea was that they were going to hire an army of data scientists and become google...magically.

Reality smacked that shit down hard. I left data engineering because the projects were all over the place, wildly undisciplined and unfocused.

You were lucky to have source control let alone an understanding from the business that these projects were in fact software development.

I switched back to software engineering because at least there is a faint realization that we are...building software.

I might go back when the dust clears.

"Why do we need to hire programmers...I thought we needed data engineers?"

"Because the data pipelines are all built with thousands of lines of code. Java, python, Fortran, you name it...and your job post only mentioned SQL and data modelling"

I could go on forever.


This is the constant argument I have with people about data products.

You don't need to expose more dimensions or get the users more access to the raw data. You need to understand what their business is and what their business problems are and help them answer those specific questions quickly and succinctly.

Yes, there are certainly times where people use huge amounts of raw data to uncover the answer to a question they didn't know they had. But it's rare, it's expensive to support, and most businesses are going to be able to do anything with it anyway (a whole org built to do X isn't suddenly going to shift to do Y because you discovered some insight in a random report).


I've seen data errors because of joins and aggregations. Data democratization can be a net negative, especially if people don't question the graphs they see.


Do you know what the author means by 'left' here? Probably not moving bits around in a way that equivalent to multiplying by powers of 2?


It's a great question: fundamentally the Parquet format offers columnar orientation. With datasets like these, there's some research[0] indicating this is a preferable way of storing and querying WARC.

DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite though when it comes to Parquet: Parquet is supported directly in Duck and this makes dealing with these datasets a breeze.

[0] https://www.researchgate.net/figure/Comparing-WARC-CDX-Parqu...


I'm disappointed to see there isn't Redshift support. What's on the roadmap to address that?


Hi, it's defiantly in our plans for the next few weeks. As I mentioned in the post we leverage dbt for the whole data monitoring layer. We wrote the package with all the cross-db best practices, but didn't test it on Redshift so there are probably some gaps. There are a few users on our community on Slack that are Redshift users and they will help us to test on production with real data. Hopefully it will not require much to make it run smoothly.


Pathstream | REMOTE or San Francisco, CA | https://pathstream.com/

Pathstream gives people the path to a good career.

We're building the technology platform to help folks up-skill and succeed in the modern digital economy.

We're hiring for staff and fullstack engineering roles.

Reach out to me directly: maxc@pathstream.com or apply via Greenhouse.

Staff Software Engineer: https://boards.greenhouse.io/pathstream/jobs/4570108003

Fullstack Software Engineer: https://boards.greenhouse.io/pathstream/jobs/4174105003


Curology | Engineering Manager, Platform Engineer, Data Engineer, and many others! | San Francisco, CA | Full-Time, Onsite

Engineering Manager

Curology's Platform Engineering Team is looking for a passionate, experienced, and creative data engineering manager to lead, grow, and mentor the Data Platform Team. As an Engineering Manager of the Data Platform Team, you will work closely across teams to design and implement the pipelines and infrastructure that power our data science, business insights, and marketing efforts. The perfect candidate will have strong data infrastructure and data architecture skills, a proven track record of leading and scaling engineering teams, strong operational skills to drive efficiency and speed, strong project management leadership, and a strong vision for how data can proactively improve companies. This is a full-time position based in our San Francisco office.

More roles on our careers page: https://curology.com/careers

Questions? Please reach out to derrick@curology.com or max@curology.com!


Curology | San Francisco, CA | DevOps and Data Engineers | $130k-$175k | FULL-TIME | ONSITE | https://curology.com

ABOUT US: Curology is a telemedicine startup focused on making dermatology accessible to everyone. We're growing quickly—300% in the past year alone—and creating excitement by helping real people see life-changing improvements in their skin.

HIRING: We're looking for talented devops and data engineers to lead development of our infrastructure and data engineering efforts. There are a lot of interesting and challenging problems to solve for both roles--scaling and privacy are top-of-mind.

BENEFITS: Relocation, competitive salary, meaningful equity, free food, health insurance, open vacation policy

These are two separate roles and the details of each can be found on AngelList:

DevOps - https://angel.co/curology/jobs/424386-devops-engineer

Data Engineer - https://angel.co/curology/jobs/426336-data-engineer

Apply via AngelList or shoot me an email: max+hn@curology.com (please include "Who is hiring" in your subject line).


SEEKING WORK - San Francisco or Remote

I am a polyglot, full-stack developer with ~10 years experience. My specialties are Python, Flask, Clojure, Node, and React as well as professional experience with MongoDB and Postgres. I am reliable, easy to work with, quick to turn things around, and a good communicator. I am available to work either on my own or as part of your team. Client satisfaction is my top priority.

Some of my open source work is available here:

https://github.com/maxcountryman

I can also provide work samples and references upon request.

Please do not hesitate to reach out: maxc@me.com


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You