more olsgaarddk's comments

olsgaarddk · on Dec 3, 2020

According to Kenneth G. Henshall, who wrote a book on the etymology of Kanji [1], it is quite common to choose a phonetic part that lend itself to the meaning of the word and not just a sound.

[1] https://www.amazon.com/Guide-Remembering-Japanese-Characters...

olsgaarddk · on Dec 3, 2020

This is a really good summary.

For anyone who want to geek this out further, I highly recommend the the introduction in "Decoding Kanji: A Practical Approach to Learning Look-alike Characters" By Yaeko Sato Habein [1]. Despite being a workbook for intermediate students, the introduction really goes all-in on this stuff and treats the topic both very academically, but it also a treasure trove of kanji related "Fun facts", and the appendix has interesting compilations such as "Pairs of homonymous kanji compounds with one kanji in common".

[1] https://www.amazon.co.uk/Decoding-Kanji-Practical-Look-alike...

olsgaarddk · on April 21, 2020

I am interested and would like to learn more. So do you just

    sudo apt install pypy3

and then

    pypy3 -m pytest /my/python/app

and if things go well you either got a 5-10x speed up or an insufficient test suite?

olsgaarddk · on March 14, 2020

Kinda off-topic, since the article is about not making errors, instead of logging them, but sometimes you are not in control of input, and you expect some exceptions that you want to log.

In Python, instead of the example in the blog:

    try:
        do_something_complex(*args, **kwargs)
    except MyException as ex:
        logger.log(ex)

Use the almost undocumented `exc_info=True`, like this

    try:
        do_something_complex(*args, **kwargs)
    except MyException as ex:
        logger.error(ex, exc_info=True)

This will log the entire exception.

Even cooler, every un-caught exception calls `sys.excepthook()`. The function is supposed to be monkey patched by anyone who wants to do something with uncaught exceptions, so you can do the following if you want to log all uncaught exceptions:

    def exception_logger_hook(exc_type, exc_value, exc_traceback):
        logger.error("Uncaught exception", exc_info=(exc_type, exc_value, exc_traceback))
        sys.__excepthook__(exc_type, exc_value, exc_traceback)
    
    sys.excephook = exception_logger_hook

This will send all uncaught exceptions to the logger and then continue with raising the exception normally.

teddyh · on March 14, 2020

> Use the almost undocumented `exc_info=True`

Why not simply use logger.exception()?

  try:
      do_thing()
  except Exception:
      logger.exception("It broke.")

Called from within an exception handler, logger.exception() automatically includes exception information.

olsgaarddk · on March 14, 2020

Mostly because I didn't know of it. But also if you want to log caught exceptions as debug, because you were expecting the exception. E.g., a StopIteration exception you for some reason might want information about, or you have an object you need to check if it has a certain key, but for some reason doesn't have a get-method.

I'm sure there are also good reasons :)

sten · on March 15, 2020

TIL, thank you sir. Gonna try this out tomorrow.

olsgaarddk · on Oct 20, 2019

During my graduate degree, I was looking into phonetic search, and a lot of the papers I stumbled across were using an expanded version of soundex called Phonix[1] as their basis of new algorithms. However, I had a lot of trouble finding an implementation I could use, so I implemented it in Python.

After that I build a search engine programme that would look up swear words that were phonetically similar to input terms, as I thought that my fellow students might get a laugh looking up their own names. In the end, neither Phonix, double metaphone nor Soundex really produced any funny results.

plugs:

- Blogpost: http://olsgaard.dk/phonixpy-phonetic-name-search-in-python.h...

- Github repo: https://github.com/olsgaard/phonetic_search

[1] Gadd, T. N. “‘Fisching Fore Werds’: Phonetic Retrieval of Written Text in Information Systems.” Program 22, no. 3 (1988): 222–37.

olsgaarddk · on Aug 17, 2019

A few years ago I downloaded several hundreds of megabytes of Japanese subtitles, split into 3 categories: live action/drama, anime and foreign film/tv

I’ve listed them in a google sheets together with a few other corpora

https://docs.google.com/spreadsheets/d/1yb5dq4ahdwc_g0aQTL3Y...

Choose the jimaku tab for subtitles to see how big the variation between corpus can be.

According to other comments here, it appears that OP list is based on a newspaper corpus from 1993.

echelon · on Aug 17, 2019

The source links appear to no longer work. Do you know where we can download Japanese subtitles?

I would love to attempt to segment a bunch of Japanese subtitles into words and then do frequency analysis. My interest is in increasing my listening ability, so I want to put the most frequently spoken words into SRS/Anki, and perhaps even break it down by anime.

Alternatively, has anyone already done this?

olsgaarddk · on Aug 17, 2019

That was my initial goal, but I had a lot of trouble with vanilla MeCab not understanding a lot of the text. But this was before neologd, so i think it would work better now.

I don’t have the source code on me, but I scraped it from a website that publishes subtitles. The scraping was easy, the cleaning not, and I believe this spreadsheet is generated from my first attempt at cleaning.

A lot of sources in Japanese nlp and linguistics have a bad habit of changing url often, so it bitrots easily. Sorry.

HN For You