aboutsummaryrefslogtreecommitdiff
path: root/todo.md
diff options
context:
space:
mode:
Diffstat (limited to 'todo.md')
-rw-r--r--todo.md150
1 files changed, 150 insertions, 0 deletions
diff --git a/todo.md b/todo.md
new file mode 100644
index 0000000..6840877
--- /dev/null
+++ b/todo.md
@@ -0,0 +1,150 @@
+# generic list of concrete todo items that don't need further consideration
+
+## 0.0.1 (standalone API)
+
+- [x] working proof of concept sentence lookup using deno/sqlite3
+- [ ] port dictionaries for more advanced testing
+ - [x] JMdict (WIP)
+ - [ ] JMNedict
+- [x] add more deinflections to db/deinflections.sql
+- [x] set up unit tests for sentence reading generation
+- [x] port server-internal API to simple HTTP JSON API
+- [ ] [improve DB schema](#how-to-store-multiple-readingswritings-in-db)
+- [ ] finish [API examples](examples/readme.md)
+- [ ] remove makefiles for database initialization
+- [ ] add separate kanji readings/info table
+- [ ] add separate frequency dictionary
+- [ ] complete documentation
+- [ ] add code formatter config
+- [ ] ~replace .sql script files with typescript sql query generation library~ ([the problem](https://www.reddit.com/r/Deno/comments/ss6568/alternative_to_knexjs_on_deno/))
+
+## 0.1.0 (front-end UI)
+
+- [ ] create primitive search page ui
+
+## always
+
+- [ ] improve sentence parser accuracy
+ - [ ] have the parser recursively explore N shorter terms at each word
+ found and rank resulting possible sentences (by frequency?)
+ - [ ] use domain-specific tags in reading tests (create domain-specific
+ dictionaries first)
+ - [ ] normalize dictionary before import
+ - [ ] remove "baked" combinations of word + suffix
+ - [ ] automatically create combinations of kanji replaced by kana as
+ alternate writings
+ - [ ] add more deinflections for casual speech and other colloquialisms
+
+# how to store multiple readings/writings in DB
+
+## idea 1
+
+positives:
+- allows multiple alternate readings/writings for terms
+- easy to find primary reading or writing for a term
+- efficiently stores kana-only words
+- allows parser to parse alternatively written words (currently requires manual
+ typescript intervention to resolve `alt` field back to actual term to get
+ it's tags)
+
+negatives:
+- ~creates duplicates in `text` column for readings of terms with different
+ kanji but the same reading~
+
+ I consider this a non-issue because this simplifies the sentence lookup
+ query. The alternative (a term\<-\>reading/writing reference table) would
+ save disk space in exchange for processing time and complexity.
+- ~unclear how to reference a specific word without using it's `term_id` (which
+ can vary from user to user when different dictionaries are installed), or
+ *what identifies a unique term in this case?*~
+
+ `user.sort_overlay` needs to be able to uniquely identify a `term_id`, but
+ also needs to be in-/exportable by users with different imported dictionaries
+ (ideally with minimal failure).
+
+ things to consider:
+
+ options:
+ - ~just use (primary) writing only~
+
+ this doesn't work for terms with multiple readings to distinguish between
+ meanings, e.g.
+ <ruby>人気<rt>ひとけ</rt></ruby>/<ruby>人気<rt>にんき</rt></ruby>
+ - ~identify as "term with text X and another text Y"~
+
+ this feels like a janky solution but is what is currently being used, where
+ X is always the default way of writing and Y the default reading
+ - directly reference `term_id` in `user.sort_overlay` and convert to matching
+ all known readings/writings at time of export/import
+
+ good:
+
+ - faster `user.sort_overlay` lookups
+ - still allows user preference import/exporting
+
+ bad:
+
+ - ~all data in `user.db` becomes useless when `dict.db` is lost or corrupted~
+
+ `user.sort_overlay` will be moved to `dict.db`, and `user.db` will only
+ be used for storing (mostly portable) user preferences and identifiers
+ (username, id, etc.).
+ - importing/exporting will take longer and require more complicated sql code
+
+
+### example tables
+
+#### readwritings (should have better name)
+
+(indexes from LSB)
+`flags[0]` = primary writing
+`flags[1]` = primary reading
+
+|`id`|`term_id`|`text`|`flags`|
+|-|-|-|-|
+|1|1|繰り返す|1|
+|2|1|くり返す|0|
+|3|1|繰返す|0|
+|4|1|繰りかえす|0|
+|5|1|くりかえす|2|
+|6|2|変える|1|
+|7|2|かえる|2|
+|8|3|帰る|1|
+|9|3|かえる|2|
+|10|4|にやにや|3|
+
+# how/where to deal with irregular readings
+
+WIP
+
+ideally one way of storing reading exceptions for:
+
+- 来る + する (conjugation-dependent)
+- 入る as (はいる) or (いる) (not sure about this one?)
+- counters (counter type + amount specific)
+- numbers (exceptions for certain powers of 10)
+
+# way to link expressions to a list of conjugated terms
+
+WIP
+
+this may be useful for dictionaries that provide meanings for expressions but
+don't provide readings for those expressions? (新和英大辞典 has some of these)
+
+examples:
+- 村長選 -> 村長 + 選[suf]
+- 花より団子 -> 花 + より[grammar] + 団子
+
+# random thoughts
+
+this project has 0 planning so here's a list of things that may eventually need
+some thought
+
+- how can a series-specific dictionary also 'encourage' the use of another
+ domain-specific category? (e.g. anime about programming makes computer domain
+ specific terms rank slightly higher or something?)
+- maybe have a mode in the front-end that captures preedit text from a user
+ typing japanese text to infer readings of kanji, or rank different terms
+ slightly higher? (using [compositionupdate
+ events](https://developer.mozilla.org/en-US/docs/Web/API/Element/compositionupdate_event))
+