For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | tzot's commentsregister

Oh, yeah! And I believe there are not many games which have Linux, Windows and MacOS versions allowing interplay. Several years ago we had one or two LAN parties with hardware running all three operating systems.

I tried this variant of JetBrains Mono and it had the perfect glyph width (reportedly -6%) for my screen and window sizes: NRK Mono Condensed from https://github.com/N-R-K/NRK-Mono. I also agree with almost all of the other modifications mentioned in the github page under “Some notable changes are:”

Now I can have side-by-side two editors plus a Structure or Project pane at the left in PyCharm while having 120 chars visible in both editors.


Neat! I'll check it out.

Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:

    import re, operator
    def count_words(filename):
        with open(filename, 'rb') as fp:
            data= memoryview(fp.read())
        word_counts= {}
        for match in re.finditer(br'\S+', data):
            word= data[match.start(): match.end()]
            try:
                word_counts[word]+= 1
            except KeyError:
                word_counts[word]= 1
        word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
        for word, count in word_counts:
            print(word.tobytes().decode(), count)
We could also use `mmap.mmap`.

This doesn't do the same thing though, since it's not Unicode aware.

    >>> 'x\u2009   a'.split()
    ['x', 'a']
    # incorrect; in bytes mode, `\S` doesn't know about unicode whitespace
    >>> list(re.finditer(br'\S+', 'x\u2009   a'.encode()))
    [<re.Match object; span=(0, 4), match=b'x\xe2\x80\x89'>, <re.Match object; span=(7, 8), match=b'a'>]
    # correct, in unicode mode
    >>> list(re.finditer(r'\S+', 'x\u2009   a'))
    [<re.Match object; span=(0, 1), match='x'>, <re.Match object; span=(5, 6), match='a'>]

OP's .split_ascii() doesn't handle U+2009 as well.

edit: OP's fully native C++ version using Pystd


Hmm? Which code are you looking at?

There's bound to be a way to turn a stream of bytes into a stream of unicode code points (at least I think that's what python is doing for strings). Though I'm explicitly not volunteering to write the code for it.

    import mmap, codecs

    from collections import Counter

    def word_count(filepath):

        freq = Counter()
    
        decode = codecs.getincrementaldecoder('utf-8')().decode
    
        with open(filepath, 'rb') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        
                for chunk in iter(lambda: mm.read(65536), b''):
            
                        freq.update(decode(chunk).split())
            
                    freq.update(decode(b'', final=True).split())
        
                return freq

Oh that's neat, though I might split this into two functions in most cases, no need to entangle opening the file and counting the words in a filelike object.

That's two neat tricks that I'm definitely adding to my bag of python trickery.


Sure, but making one string from the file contents is surely much better than having a separate string per word in the original data.

... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.

In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.


I dislike loading files into memory entirely, in fact I consider avoiding that one of the few interesting problems here (the other problem being the issue of counting words in a stream of bytes, without converting the whole thing to a string).

If you don't care about efficiency you can just do len(set(text.split())), but that's barely worth making a function for.


For reasons I never quite understood python has a collections.Counter for the purpose of counting things. It's a bit cleaner.

> It's a bit cleaner.

That's pretty much the reason why. Raymond Hettinger explains the philosophy well while discussing the `random` standard library module: https://www.youtube.com/watch?v=Uwuv05aZ6ug

I feel like much of this has been forgotten of late, though. From what I've seen, i's really quite hard to get anything added to the standard library unless you're a core dev who's sufficiently well liked among other core devs, in which case you can pretty much just do it. Everyone else will (understandably) be put through a PhD thesis defense, then asked to try the idea out as a PyPI package first (and somehow also popularize the package), and then if it somehow catches on that way, get declined anyway because it's easy for everyone to just get it from PyPI (see e.g. Requests).

I personally was directed to PyPI once when I was proposing new methods for the builtin `str`. Where the entire point was not to have to import or instantiate anything.


For the most simple case of a single job, I use the job number (`[1]` in the example) with %-notation for the background jobs in kill (which is typically a shell builtin):

    $ cat
    ^Z[1] + Stopped                    cat
    $ kill %1

On scripts that might handle filenames with spaces, I include:

    IFS='   ''
    '
Hint: the spaces between the first two apostrophes are actually one <Tab>.

This does not affect the already written script (you don't need to press Tab instead of space to separate commands and arguments in the script itself), but by making <Tab> and <LF> be the “internal field separators” will allow globbing with less quoting worries while still allowing for `files=$(ls)` constructs.

Example:

    IFS='   ''
    '
    echo hello >/tmp/"some_unique_prefix in tmp"
    cat /tmp/some_unique_prefix*
    fn="My CV.txt"
    echo "I'm alive" >/tmp/$fn
    cat /tmp/$fn
Of course this will still fail if there happens to be a filename with <Tab> in it.

I can't reproduce the glob expansion problem.

    % echo test >'/tmp/hello world'
    % cat /tmp/hello*
    test
This is bash 5.3.9.

Still, I couldn't agree more on limiting IFS. Personally, I set it only to <LF>.

In my scripts, I rely on the $(ls) idiom heavily. People I've talked to consider this an anti-pattern and suggest relying on -0, -z, --zero, --null, and -print0 flags instead. I don't deny that it's better than nothing when correctness is the goal, but I’d counter that shell is more about using a familiar interface (text representation) to solve new tasks, not about writing correct code (that’s the domain of other languages). An uncritical pursuit of correctness often results in convoluted code.

(I know that $(ls) is a subject to various expansions. I solve this problem by using a shell that doesn't do that [1].)

Another consideration is that /bin/ls and /bin/find are not the only sources of filenames. Sometimes the source is third-party or has to be user-friendly (and thus separated by traditional newlines).

Some typographical issues just can't be solved by a pursuit of mechanistic correctness. For another example, the \.txt$ idiom wouldn't work if spaces are allowed at the end of filenames. Those problems are not even shell-specific.

Those are just a few of my personal notes. Fortunately, there's a more systematic and comprehensive study of this issue [2].

[1]: https://9p.io/sys/doc/rc.html

[2]: https://dwheeler.com/essays/fixing-unix-linux-filenames.html


> # f2 is ^[OQ; to double check, run `xargs` and then press f2

I remember using `cat -v` before learning that `xargs` exists… or maybe before `xargs` actually existed on systems I used :)


Maybe you understood image as in photo-image instead of image as in memory-image (like disk-image); a glorified memory dump, more-or-less.


I understood it as the latter.


I had this issue too, so I remapped Ctrl-W/Shift-Ctrl-W to Ctrl-\/Shift-Ctrl-\ . (Also git operations became two-key sequences, starting with Ctrl-G and that damn Ctrl-K stopped being the shortcut for commit.)


An implementation from zero of a Python 3.12 VM using x86-64 assembly.


I always have been using em-dashes with specific spacing:

1. replacing parentheses —given that the em-dash in pairs for me mark more-relevant-to-the-main content than a parenthesized expression would— so I use the same spacing as `()`

2. replacing colon or just finishing the sentence with a subsentence— so the spacing goes like for a colon.

Probably unfounded grammatically and against any style guides, but this spacing makes sense to me.


I am pretty sure an em-dash in case 2 should not have spaces in either side.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You