initial public commit

author: lonkaars <loek@pipeframe.xyz> 2023-06-28 23:59:50 +0200
committer: lonkaars <loek@pipeframe.xyz> 2023-06-28 23:59:50 +0200
commit: 67dbb6421976254658c5e38045513129dd18187a (patch)
tree: 288b599d1097b26bdbcad3b6749b38e133017cf2 /db/readme.md
1 files changed, 168 insertions, 0 deletions
diff --git a/db/readme.md b/db/readme.md
new file mode 100644
index 0000000..985ee15
--- /dev/null
+++ b/db/readme.md
@@ -0,0 +1,168 @@
+# DB
+
+Yomikun's database (sqlite3) handles search scoping and deinflection. For full
+details on how Yomikun's serach algorithm works, see [search
+algorithm](#search-algorithm).
+
+This directory contains:
+
+- Database initialization scripts
+- Deinflection rules
+- User-submitted directories
+- A makefile to initialize an empty database
+- Typescript Database interface
+
+## ERD
+
+```mermaid
+erDiagram
+
+%% tables
+deinflection
+dict
+tag
+definition
+tag_label
+term
+
+%% relations
+definition  }|--|| term : "" %% definition has one term, term can have multiple definitions
+tag_label }o--|| tag : "" %% tag_label has one tag, tag can have 0 or more labels (though it should have 1)
+term }o--o{ tag : term_tag
+definition }o--o{ tag : definition_tag
+dict ||--|{ definition : ""
+```
+
+See [dictionary init.sql](dict/init.sql) for details on tables.
+
+## Search algorithm
+
+The search algorithm runs in part in the database engine, and in Typescript.
+The Typescript part is implemented in the [Language
+subdirectory](../language).
+
+|step|implemented in|
+|-|-|
+|Scoping|Typescript and SQL|
+|Deinflection|SQL|
+|Lookup|SQL|
+|Ranking|Typescript|
+
+### Scoping
+
+Sentences are parsed in chunks. Each step of the sentence parser moves the
+chunk start forward one or more character, depending on the results of the
+previous chunk.
+
+The Typescript part of the parser shifts forward the beginning of each chunk
+(vertical sections in diagram). The database engine (sqlite3) generates all
+possible substrings from the beginning of the chunk (triangle pattern in
+diagram).
+
+This diagram shows how the sentence 「大福を食べようかな」 would be scoped with
+a maximum lookahead of 5 (real parser uses 15). Red is wanted results, other
+valid matches are highlighted in light blue. Vertical axis represents time
+(forward = down), and horizontal axis is character indices for the input
+string.
+
+<!-- I know a Gantt chart isn't the 'right' chart to use here, but it gets the
+point across -->
+```mermaid
+gantt
+dateFormat X
+axisFormat %s
+
+section 大福
+大福を食べ : 0,5s
+大福を食 : 0,4s
+大福を : 0,3s
+大福 : crit,0,2s
+大 : active,0,1s
+section を
+を食べよう : 2,5s
+を食べよ : 2,4s
+を食べ : 2,3s
+を食 : 2,2s
+を : crit,2,1s
+section 食べる
+食べようか : 3,5s
+食べよう : crit,3,4s
+食べよ : active,3,3s
+食べ : active,3,2s
+食 : active,3,1s
+section かな
+かな : crit,7,2s
+か : active,7,1s
+```
+
+### Deinflection
+
+The deinflection step uses a table with simple find/replace rules, similar to
+Yomichan's deinflection rules file. Each rule has a `kana_in`, `kana_out`,
+`rules_in` and `rules_out` column. The rules are applied using the following
+algorithm (psuedocode, real implementation is almost unreadably large SQL
+query):
+
+```python
+possibilities = []
+function deconjugate(original_input, depth) {
+  for (rule in deinflection_rules) {
+    # reset input after each rule check
+    input = original_input;
+
+    # check if rule matches at all
+    if (input does not end in rule.kana_in) continue;
+    # make sure (for example) godan deconjugations don't get applied
+    # after ichidan deconjugations
+    if (rule.rules_in does not contain input.rules) continue;
+    # swap kana_in for kana_out on input string
+    input.replace_end(rule.kana_in, rule.kana_out);
+    # check if deconjugation didn't clear the input
+    if (input.length < 1) continue;
+
+    # apply new rules to input
+    input.rules = rule.rules_out;
+    
+    # attempt another deconjugation step
+    depth += 1;
+    deconjugate(input, depth);
+  }
+}
+```
+
+The deinflection rules' `rules_in` and `rules_out` are checked using bitwise
+operators, and each bit has the same meaning as Yomichan's `rulesIn` and
+`rulesOut`:
+
+|alias|bitmask|meaning|
+|-|-|-|
+|a|`-1`|all (allow all rules)|
+||`0`|nothing|
+|ru|`1 << 0`|一段活用 (ichidan a.k.a. ru-verbs in tae kim's japanese grammar guide)|
+|u|`1 << 1`|五段活用 (godan a.k.a. u-verbs in tae kim's japanese grammar guide)|
+|s|`1 << 2`|する (suru)|
+|k|`1 << 3`|くる (kuru)|
+|z|`1 << 4`|ずる (zuru)|
+|i|`1 << 5`|形容詞 (i-adjective)|
+|iru|`1 << 6`|〜いる (temporary iru for progressive tense)|
+
+The deinflection rules are mostly derived from [Tae Kim's Japanese grammar
+guide][taekim] and are initialized in [deinflections.sql](dict/deinflections.sql).
+
+### Lookup
+
+Lookup is done on the results of the deinflection step in the database. All
+possible deinflections are checked for an exact match on either the reading or
+writing of a word in all dictionaries. This step appends any ~dictionary and~
+word tags to the tags added by the deinflector.
+
+### Ranking
+
+Ranking happens in Typescript. Ranking also removes additional illegally
+deconjugated words and gives priority to words with certain tags depending on
+context. The filtering/ranking code is intentionally kept as readable as
+possible because this code is mostly responsible for generating readings as
+accurately as possible. Read the code [here](../language/parser.ts).
+
+[taekim]: https://guidetojapanese.org/learn/
+
author	lonkaars <loek@pipeframe.xyz>	2023-06-28 23:59:50 +0200
committer	lonkaars <loek@pipeframe.xyz>	2023-06-28 23:59:50 +0200
commit	67dbb6421976254658c5e38045513129dd18187a (patch)
tree	288b599d1097b26bdbcad3b6749b38e133017cf2 /db/readme.md