diff options
Diffstat (limited to 'todo.md')
-rw-r--r-- | todo.md | 150 |
1 files changed, 150 insertions, 0 deletions
@@ -0,0 +1,150 @@ +# generic list of concrete todo items that don't need further consideration + +## 0.0.1 (standalone API) + +- [x] working proof of concept sentence lookup using deno/sqlite3 +- [ ] port dictionaries for more advanced testing + - [x] JMdict (WIP) + - [ ] JMNedict +- [x] add more deinflections to db/deinflections.sql +- [x] set up unit tests for sentence reading generation +- [x] port server-internal API to simple HTTP JSON API +- [ ] [improve DB schema](#how-to-store-multiple-readingswritings-in-db) +- [ ] finish [API examples](examples/readme.md) +- [ ] remove makefiles for database initialization +- [ ] add separate kanji readings/info table +- [ ] add separate frequency dictionary +- [ ] complete documentation +- [ ] add code formatter config +- [ ] ~replace .sql script files with typescript sql query generation library~ ([the problem](https://www.reddit.com/r/Deno/comments/ss6568/alternative_to_knexjs_on_deno/)) + +## 0.1.0 (front-end UI) + +- [ ] create primitive search page ui + +## always + +- [ ] improve sentence parser accuracy + - [ ] have the parser recursively explore N shorter terms at each word + found and rank resulting possible sentences (by frequency?) + - [ ] use domain-specific tags in reading tests (create domain-specific + dictionaries first) + - [ ] normalize dictionary before import + - [ ] remove "baked" combinations of word + suffix + - [ ] automatically create combinations of kanji replaced by kana as + alternate writings + - [ ] add more deinflections for casual speech and other colloquialisms + +# how to store multiple readings/writings in DB + +## idea 1 + +positives: +- allows multiple alternate readings/writings for terms +- easy to find primary reading or writing for a term +- efficiently stores kana-only words +- allows parser to parse alternatively written words (currently requires manual + typescript intervention to resolve `alt` field back to actual term to get + it's tags) + +negatives: +- ~creates duplicates in `text` column for readings of terms with different + kanji but the same reading~ + + I consider this a non-issue because this simplifies the sentence lookup + query. The alternative (a term\<-\>reading/writing reference table) would + save disk space in exchange for processing time and complexity. +- ~unclear how to reference a specific word without using it's `term_id` (which + can vary from user to user when different dictionaries are installed), or + *what identifies a unique term in this case?*~ + + `user.sort_overlay` needs to be able to uniquely identify a `term_id`, but + also needs to be in-/exportable by users with different imported dictionaries + (ideally with minimal failure). + + things to consider: + + options: + - ~just use (primary) writing only~ + + this doesn't work for terms with multiple readings to distinguish between + meanings, e.g. + <ruby>人気<rt>ひとけ</rt></ruby>/<ruby>人気<rt>にんき</rt></ruby> + - ~identify as "term with text X and another text Y"~ + + this feels like a janky solution but is what is currently being used, where + X is always the default way of writing and Y the default reading + - directly reference `term_id` in `user.sort_overlay` and convert to matching + all known readings/writings at time of export/import + + good: + + - faster `user.sort_overlay` lookups + - still allows user preference import/exporting + + bad: + + - ~all data in `user.db` becomes useless when `dict.db` is lost or corrupted~ + + `user.sort_overlay` will be moved to `dict.db`, and `user.db` will only + be used for storing (mostly portable) user preferences and identifiers + (username, id, etc.). + - importing/exporting will take longer and require more complicated sql code + + +### example tables + +#### readwritings (should have better name) + +(indexes from LSB) +`flags[0]` = primary writing +`flags[1]` = primary reading + +|`id`|`term_id`|`text`|`flags`| +|-|-|-|-| +|1|1|繰り返す|1| +|2|1|くり返す|0| +|3|1|繰返す|0| +|4|1|繰りかえす|0| +|5|1|くりかえす|2| +|6|2|変える|1| +|7|2|かえる|2| +|8|3|帰る|1| +|9|3|かえる|2| +|10|4|にやにや|3| + +# how/where to deal with irregular readings + +WIP + +ideally one way of storing reading exceptions for: + +- 来る + する (conjugation-dependent) +- 入る as (はいる) or (いる) (not sure about this one?) +- counters (counter type + amount specific) +- numbers (exceptions for certain powers of 10) + +# way to link expressions to a list of conjugated terms + +WIP + +this may be useful for dictionaries that provide meanings for expressions but +don't provide readings for those expressions? (新和英大辞典 has some of these) + +examples: +- 村長選 -> 村長 + 選[suf] +- 花より団子 -> 花 + より[grammar] + 団子 + +# random thoughts + +this project has 0 planning so here's a list of things that may eventually need +some thought + +- how can a series-specific dictionary also 'encourage' the use of another + domain-specific category? (e.g. anime about programming makes computer domain + specific terms rank slightly higher or something?) +- maybe have a mode in the front-end that captures preedit text from a user + typing japanese text to infer readings of kanji, or rank different terms + slightly higher? (using [compositionupdate + events](https://developer.mozilla.org/en-US/docs/Web/API/Element/compositionupdate_event)) + |