aboutsummaryrefslogtreecommitdiff
path: root/todo.md
blob: 68408774d18df1aaddbc6ae3c54f59a429d9ee98 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# generic list of concrete todo items that don't need further consideration

## 0.0.1 (standalone API)

- [x] working proof of concept sentence lookup using deno/sqlite3
- [ ] port dictionaries for more advanced testing
    - [x] JMdict (WIP)
    - [ ] JMNedict
- [x] add more deinflections to db/deinflections.sql
- [x] set up unit tests for sentence reading generation
- [x] port server-internal API to simple HTTP JSON API
- [ ] [improve DB schema](#how-to-store-multiple-readingswritings-in-db)
- [ ] finish [API examples](examples/readme.md)
- [ ] remove makefiles for database initialization
- [ ] add separate kanji readings/info table
- [ ] add separate frequency dictionary
- [ ] complete documentation
- [ ] add code formatter config
- [ ] ~replace .sql script files with typescript sql query generation library~ ([the problem](https://www.reddit.com/r/Deno/comments/ss6568/alternative_to_knexjs_on_deno/))

## 0.1.0 (front-end UI)

- [ ] create primitive search page ui

## always

- [ ] improve sentence parser accuracy
    - [ ] have the parser recursively explore N shorter terms at each word
      found and rank resulting possible sentences (by frequency?)
    - [ ] use domain-specific tags in reading tests (create domain-specific
      dictionaries first)
    - [ ] normalize dictionary before import
        - [ ] remove "baked" combinations of word + suffix
        - [ ] automatically create combinations of kanji replaced by kana as
          alternate writings
    - [ ] add more deinflections for casual speech and other colloquialisms

# how to store multiple readings/writings in DB

## idea 1

positives:
- allows multiple alternate readings/writings for terms
- easy to find primary reading or writing for a term
- efficiently stores kana-only words
- allows parser to parse alternatively written words (currently requires manual
  typescript intervention to resolve `alt` field back to actual term to get
  it's tags)

negatives:
- ~creates duplicates in `text` column for readings of terms with different
  kanji but the same reading~
  
  I consider this a non-issue because this simplifies the sentence lookup
  query. The alternative (a term\<-\>reading/writing reference table) would
  save disk space in exchange for processing time and complexity.
- ~unclear how to reference a specific word without using it's `term_id` (which
  can vary from user to user when different dictionaries are installed), or
  *what identifies a unique term in this case?*~
  
  `user.sort_overlay` needs to be able to uniquely identify a `term_id`, but
  also needs to be in-/exportable by users with different imported dictionaries
  (ideally with minimal failure).
  
  things to consider:
  
  options:
  - ~just use (primary) writing only~
    
    this doesn't work for terms with multiple readings to distinguish between
    meanings, e.g.
    <ruby>人気<rt>ひとけ</rt></ruby>/<ruby>人気<rt>にんき</rt></ruby>
  - ~identify as "term with text X and another text Y"~
    
    this feels like a janky solution but is what is currently being used, where
    X is always the default way of writing and Y the default reading
  - directly reference `term_id` in `user.sort_overlay` and convert to matching
    all known readings/writings at time of export/import
    
    good:
    
    - faster `user.sort_overlay` lookups
    - still allows user preference import/exporting
    
    bad:
    
    - ~all data in `user.db` becomes useless when `dict.db` is lost or corrupted~
      
      `user.sort_overlay` will be moved to `dict.db`, and `user.db` will only
      be used for storing (mostly portable) user preferences and identifiers
      (username, id, etc.).
    - importing/exporting will take longer and require more complicated sql code
  

### example tables

#### readwritings (should have better name)

(indexes from LSB)  
`flags[0]` = primary writing  
`flags[1]` = primary reading

|`id`|`term_id`|`text`|`flags`|
|-|-|-|-|
|1|1|繰り返す|1|
|2|1|くり返す|0|
|3|1|繰返す|0|
|4|1|繰りかえす|0|
|5|1|くりかえす|2|
|6|2|変える|1|
|7|2|かえる|2|
|8|3|帰る|1|
|9|3|かえる|2|
|10|4|にやにや|3|

# how/where to deal with irregular readings

WIP

ideally one way of storing reading exceptions for:

- 来る + する (conjugation-dependent)
- 入る as (はいる) or (いる) (not sure about this one?)
- counters (counter type + amount specific)
- numbers (exceptions for certain powers of 10)

# way to link expressions to a list of conjugated terms

WIP

this may be useful for dictionaries that provide meanings for expressions but
don't provide readings for those expressions? (新和英大辞典 has some of these)

examples:
- 村長選 -> 村長 + 選[suf]
- 花より団子 -> 花 + より[grammar] + 団子

# random thoughts

this project has 0 planning so here's a list of things that may eventually need
some thought

- how can a series-specific dictionary also 'encourage' the use of another
  domain-specific category? (e.g. anime about programming makes computer domain
  specific terms rank slightly higher or something?)
- maybe have a mode in the front-end that captures preedit text from a user
  typing japanese text to infer readings of kanji, or rank different terms
  slightly higher? (using [compositionupdate
  events](https://developer.mozilla.org/en-US/docs/Web/API/Element/compositionupdate_event))