According to Kenneth G. Henshall, who wrote a book on the etymology of Kanji [1], it is quite common to choose a phonetic part that lend itself to the meaning of the word and not just a sound.
For anyone who want to geek this out further, I highly recommend the the introduction in "Decoding Kanji: A Practical Approach to Learning Look-alike Characters" By Yaeko Sato Habein [1]. Despite being a workbook for intermediate students, the introduction really goes all-in on this stuff and treats the topic both very academically, but it also a treasure trove of kanji related "Fun facts", and the appendix has interesting compilations such as "Pairs of homonymous kanji compounds with one kanji in common".
Kinda off-topic, since the article is about not making errors, instead of logging them, but sometimes you are not in control of input, and you expect some exceptions that you want to log.
In Python, instead of the example in the blog:
try:
do_something_complex(*args, **kwargs)
except MyException as ex:
logger.log(ex)
Use the almost undocumented `exc_info=True`, like this
try:
do_something_complex(*args, **kwargs)
except MyException as ex:
logger.error(ex, exc_info=True)
This will log the entire exception.
Even cooler, every un-caught exception calls `sys.excepthook()`. The function is supposed to be monkey patched by anyone who wants to do something with uncaught exceptions, so you can do the following if you want to log all uncaught exceptions:
Mostly because I didn't know of it. But also if you want to log caught exceptions as debug, because you were expecting the exception. E.g., a StopIteration exception you for some reason might want information about, or you have an object you need to check if it has a certain key, but for some reason doesn't have a get-method.
During my graduate degree, I was looking into phonetic search, and a lot of the papers I stumbled across were using an expanded version of soundex called Phonix[1] as their basis of new algorithms. However, I had a lot of trouble finding an implementation I could use, so I implemented it in Python.
After that I build a search engine programme that would look up swear words that were phonetically similar to input terms, as I thought that my fellow students might get a laugh looking up their own names. In the end, neither Phonix, double metaphone nor Soundex really produced any funny results.
A few years ago I downloaded several hundreds of megabytes of Japanese subtitles, split into 3 categories: live action/drama, anime and foreign film/tv
I’ve listed them in a google sheets together with a few other corpora
The source links appear to no longer work. Do you know where we can download Japanese subtitles?
I would love to attempt to segment a bunch of Japanese subtitles into words and then do frequency analysis. My interest is in increasing my listening ability, so I want to put the most frequently spoken words into SRS/Anki, and perhaps even break it down by anime.
That was my initial goal, but I had a lot of trouble with vanilla MeCab not understanding a lot of the text. But this was before neologd, so i think it would work better now.
I don’t have the source code on me, but I scraped it from a website that publishes subtitles. The scraping was easy, the cleaning not, and I believe this spreadsheet is generated from my first attempt at cleaning.
A lot of sources in Japanese nlp and linguistics have a bad habit of changing url often, so it bitrots easily. Sorry.
[1] https://www.amazon.com/Guide-Remembering-Japanese-Characters...