# DB Yomikun's database (sqlite3) handles search scoping and deinflection. For full details on how Yomikun's serach algorithm works, see [search algorithm](#search-algorithm). This directory contains: - Database initialization scripts - Deinflection rules - User-submitted directories - A makefile to initialize an empty database - Typescript Database interface ## ERD ```mermaid erDiagram %% tables deinflection dict tag definition tag_label term %% relations definition }|--|| term : "" %% definition has one term, term can have multiple definitions tag_label }o--|| tag : "" %% tag_label has one tag, tag can have 0 or more labels (though it should have 1) term }o--o{ tag : term_tag definition }o--o{ tag : definition_tag dict ||--|{ definition : "" ``` See [dictionary init.sql](dict/init.sql) for details on tables. ## Search algorithm The search algorithm runs in part in the database engine, and in Typescript. The Typescript part is implemented in the [Language subdirectory](../language). |step|implemented in| |-|-| |Scoping|Typescript and SQL| |Deinflection|SQL| |Lookup|SQL| |Ranking|Typescript| ### Scoping Sentences are parsed in chunks. Each step of the sentence parser moves the chunk start forward one or more character, depending on the results of the previous chunk. The Typescript part of the parser shifts forward the beginning of each chunk (vertical sections in diagram). The database engine (sqlite3) generates all possible substrings from the beginning of the chunk (triangle pattern in diagram). This diagram shows how the sentence 「大福を食べようかな」 would be scoped with a maximum lookahead of 5 (real parser uses 15). Red is wanted results, other valid matches are highlighted in light blue. Vertical axis represents time (forward = down), and horizontal axis is character indices for the input string. ```mermaid gantt dateFormat X axisFormat %s section 大福 大福を食べ : 0,5s 大福を食 : 0,4s 大福を : 0,3s 大福 : crit,0,2s 大 : active,0,1s section を を食べよう : 2,5s を食べよ : 2,4s を食べ : 2,3s を食 : 2,2s を : crit,2,1s section 食べる 食べようか : 3,5s 食べよう : crit,3,4s 食べよ : active,3,3s 食べ : active,3,2s 食 : active,3,1s section かな かな : crit,7,2s か : active,7,1s ``` ### Deinflection The deinflection step uses a table with simple find/replace rules, similar to Yomichan's deinflection rules file. Each rule has a `kana_in`, `kana_out`, `rules_in` and `rules_out` column. The rules are applied using the following algorithm (psuedocode, real implementation is almost unreadably large SQL query): ```python possibilities = [] function deconjugate(original_input, depth) { for (rule in deinflection_rules) { # reset input after each rule check input = original_input; # check if rule matches at all if (input does not end in rule.kana_in) continue; # make sure (for example) godan deconjugations don't get applied # after ichidan deconjugations if (rule.rules_in does not contain input.rules) continue; # swap kana_in for kana_out on input string input.replace_end(rule.kana_in, rule.kana_out); # check if deconjugation didn't clear the input if (input.length < 1) continue; # apply new rules to input input.rules = rule.rules_out; # attempt another deconjugation step depth += 1; deconjugate(input, depth); } } ``` The deinflection rules' `rules_in` and `rules_out` are checked using bitwise operators, and each bit has the same meaning as Yomichan's `rulesIn` and `rulesOut`: |alias|bitmask|meaning| |-|-|-| |a|`-1`|all (allow all rules)| ||`0`|nothing| |ru|`1 << 0`|一段活用 (ichidan a.k.a. ru-verbs in tae kim's japanese grammar guide)| |u|`1 << 1`|五段活用 (godan a.k.a. u-verbs in tae kim's japanese grammar guide)| |s|`1 << 2`|する (suru)| |k|`1 << 3`|くる (kuru)| |z|`1 << 4`|ずる (zuru)| |i|`1 << 5`|形容詞 (i-adjective)| |iru|`1 << 6`|〜いる (temporary iru for progressive tense)| The deinflection rules are mostly derived from [Tae Kim's Japanese grammar guide][taekim] and are initialized in [deinflections.sql](dict/deinflections.sql). ### Lookup Lookup is done on the results of the deinflection step in the database. All possible deinflections are checked for an exact match on either the reading or writing of a word in all dictionaries. This step appends any ~dictionary and~ word tags to the tags added by the deinflector. ### Ranking Ranking happens in Typescript. Ranking also removes additional illegally deconjugated words and gives priority to words with certain tags depending on context. The filtering/ranking code is intentionally kept as readable as possible because this code is mostly responsible for generating readings as accurately as possible. Read the code [here](../language/parser.ts). [taekim]: https://guidetojapanese.org/learn/