aboutsummaryrefslogtreecommitdiff
path: root/db/readme.md
blob: 985ee152257958d79e5503f1e08d8bce5427eebd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# DB

Yomikun's database (sqlite3) handles search scoping and deinflection. For full
details on how Yomikun's serach algorithm works, see [search
algorithm](#search-algorithm).

This directory contains:

- Database initialization scripts
- Deinflection rules
- User-submitted directories
- A makefile to initialize an empty database
- Typescript Database interface

## ERD

```mermaid
erDiagram

%% tables
deinflection
dict
tag
definition
tag_label
term

%% relations
definition  }|--|| term : "" %% definition has one term, term can have multiple definitions
tag_label }o--|| tag : "" %% tag_label has one tag, tag can have 0 or more labels (though it should have 1)
term }o--o{ tag : term_tag
definition }o--o{ tag : definition_tag
dict ||--|{ definition : ""
```

See [dictionary init.sql](dict/init.sql) for details on tables.

## Search algorithm

The search algorithm runs in part in the database engine, and in Typescript.
The Typescript part is implemented in the [Language
subdirectory](../language).

|step|implemented in|
|-|-|
|Scoping|Typescript and SQL|
|Deinflection|SQL|
|Lookup|SQL|
|Ranking|Typescript|

### Scoping

Sentences are parsed in chunks. Each step of the sentence parser moves the
chunk start forward one or more character, depending on the results of the
previous chunk.

The Typescript part of the parser shifts forward the beginning of each chunk
(vertical sections in diagram). The database engine (sqlite3) generates all
possible substrings from the beginning of the chunk (triangle pattern in
diagram).

This diagram shows how the sentence 「大福を食べようかな」 would be scoped with
a maximum lookahead of 5 (real parser uses 15). Red is wanted results, other
valid matches are highlighted in light blue. Vertical axis represents time
(forward = down), and horizontal axis is character indices for the input
string.

<!-- I know a Gantt chart isn't the 'right' chart to use here, but it gets the
point across -->
```mermaid
gantt
dateFormat X
axisFormat %s

section 大福
大福を食べ : 0,5s
大福を食 : 0,4s
大福を : 0,3s
大福 : crit,0,2s
大 : active,0,1s
section を
を食べよう : 2,5s
を食べよ : 2,4s
を食べ : 2,3s
を食 : 2,2s
を : crit,2,1s
section 食べる
食べようか : 3,5s
食べよう : crit,3,4s
食べよ : active,3,3s
食べ : active,3,2s
食 : active,3,1s
section かな
かな : crit,7,2s
か : active,7,1s
```

### Deinflection

The deinflection step uses a table with simple find/replace rules, similar to
Yomichan's deinflection rules file. Each rule has a `kana_in`, `kana_out`,
`rules_in` and `rules_out` column. The rules are applied using the following
algorithm (psuedocode, real implementation is almost unreadably large SQL
query):

```python
possibilities = []
function deconjugate(original_input, depth) {
  for (rule in deinflection_rules) {
    # reset input after each rule check
    input = original_input;

    # check if rule matches at all
    if (input does not end in rule.kana_in) continue;
    # make sure (for example) godan deconjugations don't get applied
    # after ichidan deconjugations
    if (rule.rules_in does not contain input.rules) continue;
    # swap kana_in for kana_out on input string
    input.replace_end(rule.kana_in, rule.kana_out);
    # check if deconjugation didn't clear the input
    if (input.length < 1) continue;

    # apply new rules to input
    input.rules = rule.rules_out;
    
    # attempt another deconjugation step
    depth += 1;
    deconjugate(input, depth);
  }
}
```

The deinflection rules' `rules_in` and `rules_out` are checked using bitwise
operators, and each bit has the same meaning as Yomichan's `rulesIn` and
`rulesOut`:

|alias|bitmask|meaning|
|-|-|-|
|a|`-1`|all (allow all rules)|
||`0`|nothing|
|ru|`1 << 0`|一段活用 (ichidan a.k.a. ru-verbs in tae kim's japanese grammar guide)|
|u|`1 << 1`|五段活用 (godan a.k.a. u-verbs in tae kim's japanese grammar guide)|
|s|`1 << 2`|する (suru)|
|k|`1 << 3`|くる (kuru)|
|z|`1 << 4`|ずる (zuru)|
|i|`1 << 5`|形容詞 (i-adjective)|
|iru|`1 << 6`|〜いる (temporary iru for progressive tense)|

The deinflection rules are mostly derived from [Tae Kim's Japanese grammar
guide][taekim] and are initialized in [deinflections.sql](dict/deinflections.sql).

### Lookup

Lookup is done on the results of the deinflection step in the database. All
possible deinflections are checked for an exact match on either the reading or
writing of a word in all dictionaries. This step appends any ~dictionary and~
word tags to the tags added by the deinflector.

### Ranking

Ranking happens in Typescript. Ranking also removes additional illegally
deconjugated words and gives priority to words with certain tags depending on
context. The filtering/ranking code is intentionally kept as readable as
possible because this code is mostly responsible for generating readings as
accurately as possible. Read the code [here](../language/parser.ts).

[taekim]: https://guidetojapanese.org/learn/