diff options
author | lonkaars <loek@pipeframe.xyz> | 2023-06-28 23:59:50 +0200 |
---|---|---|
committer | lonkaars <loek@pipeframe.xyz> | 2023-06-28 23:59:50 +0200 |
commit | 67dbb6421976254658c5e38045513129dd18187a (patch) | |
tree | 288b599d1097b26bdbcad3b6749b38e133017cf2 /db/readme.md |
initial public commit
Diffstat (limited to 'db/readme.md')
-rw-r--r-- | db/readme.md | 168 |
1 files changed, 168 insertions, 0 deletions
diff --git a/db/readme.md b/db/readme.md new file mode 100644 index 0000000..985ee15 --- /dev/null +++ b/db/readme.md @@ -0,0 +1,168 @@ +# DB + +Yomikun's database (sqlite3) handles search scoping and deinflection. For full +details on how Yomikun's serach algorithm works, see [search +algorithm](#search-algorithm). + +This directory contains: + +- Database initialization scripts +- Deinflection rules +- User-submitted directories +- A makefile to initialize an empty database +- Typescript Database interface + +## ERD + +```mermaid +erDiagram + +%% tables +deinflection +dict +tag +definition +tag_label +term + +%% relations +definition }|--|| term : "" %% definition has one term, term can have multiple definitions +tag_label }o--|| tag : "" %% tag_label has one tag, tag can have 0 or more labels (though it should have 1) +term }o--o{ tag : term_tag +definition }o--o{ tag : definition_tag +dict ||--|{ definition : "" +``` + +See [dictionary init.sql](dict/init.sql) for details on tables. + +## Search algorithm + +The search algorithm runs in part in the database engine, and in Typescript. +The Typescript part is implemented in the [Language +subdirectory](../language). + +|step|implemented in| +|-|-| +|Scoping|Typescript and SQL| +|Deinflection|SQL| +|Lookup|SQL| +|Ranking|Typescript| + +### Scoping + +Sentences are parsed in chunks. Each step of the sentence parser moves the +chunk start forward one or more character, depending on the results of the +previous chunk. + +The Typescript part of the parser shifts forward the beginning of each chunk +(vertical sections in diagram). The database engine (sqlite3) generates all +possible substrings from the beginning of the chunk (triangle pattern in +diagram). + +This diagram shows how the sentence 「大福を食べようかな」 would be scoped with +a maximum lookahead of 5 (real parser uses 15). Red is wanted results, other +valid matches are highlighted in light blue. Vertical axis represents time +(forward = down), and horizontal axis is character indices for the input +string. + +<!-- I know a Gantt chart isn't the 'right' chart to use here, but it gets the +point across --> +```mermaid +gantt +dateFormat X +axisFormat %s + +section 大福 +大福を食べ : 0,5s +大福を食 : 0,4s +大福を : 0,3s +大福 : crit,0,2s +大 : active,0,1s +section を +を食べよう : 2,5s +を食べよ : 2,4s +を食べ : 2,3s +を食 : 2,2s +を : crit,2,1s +section 食べる +食べようか : 3,5s +食べよう : crit,3,4s +食べよ : active,3,3s +食べ : active,3,2s +食 : active,3,1s +section かな +かな : crit,7,2s +か : active,7,1s +``` + +### Deinflection + +The deinflection step uses a table with simple find/replace rules, similar to +Yomichan's deinflection rules file. Each rule has a `kana_in`, `kana_out`, +`rules_in` and `rules_out` column. The rules are applied using the following +algorithm (psuedocode, real implementation is almost unreadably large SQL +query): + +```python +possibilities = [] +function deconjugate(original_input, depth) { + for (rule in deinflection_rules) { + # reset input after each rule check + input = original_input; + + # check if rule matches at all + if (input does not end in rule.kana_in) continue; + # make sure (for example) godan deconjugations don't get applied + # after ichidan deconjugations + if (rule.rules_in does not contain input.rules) continue; + # swap kana_in for kana_out on input string + input.replace_end(rule.kana_in, rule.kana_out); + # check if deconjugation didn't clear the input + if (input.length < 1) continue; + + # apply new rules to input + input.rules = rule.rules_out; + + # attempt another deconjugation step + depth += 1; + deconjugate(input, depth); + } +} +``` + +The deinflection rules' `rules_in` and `rules_out` are checked using bitwise +operators, and each bit has the same meaning as Yomichan's `rulesIn` and +`rulesOut`: + +|alias|bitmask|meaning| +|-|-|-| +|a|`-1`|all (allow all rules)| +||`0`|nothing| +|ru|`1 << 0`|一段活用 (ichidan a.k.a. ru-verbs in tae kim's japanese grammar guide)| +|u|`1 << 1`|五段活用 (godan a.k.a. u-verbs in tae kim's japanese grammar guide)| +|s|`1 << 2`|する (suru)| +|k|`1 << 3`|くる (kuru)| +|z|`1 << 4`|ずる (zuru)| +|i|`1 << 5`|形容詞 (i-adjective)| +|iru|`1 << 6`|〜いる (temporary iru for progressive tense)| + +The deinflection rules are mostly derived from [Tae Kim's Japanese grammar +guide][taekim] and are initialized in [deinflections.sql](dict/deinflections.sql). + +### Lookup + +Lookup is done on the results of the deinflection step in the database. All +possible deinflections are checked for an exact match on either the reading or +writing of a word in all dictionaries. This step appends any ~dictionary and~ +word tags to the tags added by the deinflector. + +### Ranking + +Ranking happens in Typescript. Ranking also removes additional illegally +deconjugated words and gives priority to words with certain tags depending on +context. The filtering/ranking code is intentionally kept as readable as +possible because this code is mostly responsible for generating readings as +accurately as possible. Read the code [here](../language/parser.ts). + +[taekim]: https://guidetojapanese.org/learn/ + |