diff options
Diffstat (limited to 'docs/development')
-rw-r--r-- | docs/development/language-features.md | 93 |
1 files changed, 79 insertions, 14 deletions
diff --git a/docs/development/language-features.md b/docs/development/language-features.md index 7ce44a8a..799a6776 100644 --- a/docs/development/language-features.md +++ b/docs/development/language-features.md @@ -41,7 +41,7 @@ That's it! Your language should now be selectable from the dropdown, and may wor You should first have the repo set up locally according to the instructions in the [contributing guidelines](../../CONTRIBUTING.md). -A language descriptor in `language-descriptors.js` has several optional fields for more advanced features. We've already mentioned `textPreprocessors`, but there are also `languageTransforms`, `textPostprocessors`, and `isTextLookupWorthy`. Let's go through them (see also the full type definition in `language-descriptors.d.ts`). +A language descriptor in `language-descriptors.js` has several optional fields for more advanced features. We've already mentioned `textPreprocessors`, but there are also `languageTransforms`, `textPostprocessors`, `isTextLookupWorthy`, and `readingNormalizer`. Let's go through them (see also the full type definition in `language-descriptors.d.ts`). ### Text Preprocessors @@ -133,7 +133,7 @@ Deinflection is the process of converting a word to its base or dictionary form. These grammatical rules are located in files such as `english-transforms.js`. -> Not all the grammatical rules of a language can or need to be implemented in the transforms file. Even a little bit goes a long way, and you can always add more rules later. For every couple rules you add, write some tests in the respective file (see `english-transforms.test.js`), and run `npm run test:unit`. This will help you verify that your rules are correct, and make sure nothing is accidentally broken along the way. +> Not all the grammatical rules of a language can or need to be implemented in the transforms file. Even a little bit goes a long way, and you can always add more rules later. For every couple rules you add, write some tests in the respective file ([see the writing tests section below](#writing-deinflection-tests)). This will help you verify that your rules are correct, and make sure nothing is accidentally broken along the way. Transforms files should export a `LanguageTransformDescriptor`, which is then imported in `language-descriptors.js`. @@ -142,12 +142,14 @@ Transforms files should export a `LanguageTransformDescriptor`, which is then im export type LanguageTransformDescriptor = { language: string; conditions: ConditionMapObject; - transforms: Transform[]; + transforms: { + [name: string]: Transform; + }; }; ``` - `language` is the ISO code of the language -- `conditions` are an array of parts of speech, and grammatical forms that are used to check which deinflections make sense. They are referenced by the deinflection rules. +- `conditions` are an object containing parts of speech and grammatical forms that are used to check which deinflections make sense. They are referenced by the deinflection rules. - `transforms` are the actual deinflection rules Let's try and write a bit of deinflection for English, from scratch. @@ -159,17 +161,17 @@ import { suffixInflection } from "../language-transforms.js"; export const englishTransforms = { language: "en", conditions: {}, - transforms: [ - { + transforms: { + plural: { name: "plural", description: "Plural form of a noun", rules: [suffixInflection("s", "", [], [])], }, - ], + }, }; ``` -This is a simple example for English, where the only deinflection rule is to remove the "s" from the end of a noun to get the singular form. The `suffixInflection` function is a helper that creates a deinflection rule for a suffix. It takes the suffix to remove, what to replace it with, and two more parameters for conditions, which we will look at next. The `suffixInflection` is the most common type of deinflection rule across languages. +This is a simple example for English, where the only deinflection rule is to remove the "s" from the end of a noun to get the singular form. The `suffixInflection` function is a helper that creates a deinflection rule for a suffix. It takes the suffix to remove, what to replace it with, and two more parameters for conditions, which we will look at next. The `suffixInflection` is the most common type of deinflection rule across languages. The inner `plural` is the displayed description while looking up, and the outer `plural` is a name only to be referenced internally within the file. For the input string "cats", the following strings will be looked up: @@ -187,15 +189,15 @@ conditions: { isDictionaryForm: true, }, }, -transforms: [ - { +transforms: { + "plural": { name: "plural", description: "Plural form of a noun", rules: [ suffixInflection("s", "", [], ["n"]) ], }, -], +}, ``` Now, only dictionary entries marked with the same "n" condition will be eligible for matching the `plural` rule. The verb "read" should be marked as "v" in the dictionary, and will no longer be matched by the `plural` rule. The entries in the dictionary need to be marked with the exact same conditions defined in the `conditions` object. The `isDictionaryForm` field can be set to `false`, to allow some conditions to be sued only in between rules, and not in the dictionary. In most cases however, it will be set to `true`. @@ -244,21 +246,80 @@ conditions: { isDictionaryForm: true, }, }, -transforms: [ - { +transforms: { + "plural": { name: "plural", description: "Plural form of a noun", rules: [ suffixInflection("s", "", ["np"], ["ns"]) ], }, -], +}, ``` Since `ns` and `np` are subconditions of `n` they will both match with `n`, but not with each other. This covers all of the requirements we have considered. The `suffixInflection` is one of a few helper functions - you can write more complex rules, using regex and a function for deinflecting. There are examples of this across the language transforms files. +#### Writing Deinflection Tests + +Now that you have added a couple deinflection rules, you might want to start writing some tests to check if the deinflections are behaving correctly. Let's say we wanted to test the behavior of our `plural` and `possessive` rules and even them combined. Our test file should look like this: + +```js +// english-transforms.test.js +import { englishTransforms } from "../../ext/js/language/en/english-transforms.js"; +import { LanguageTransformer } from "../../ext/js/language/language-transformer.js"; +import { testLanguageTransformer } from "../fixtures/language-transformer-test.js"; + +const tests = [ + { + category: "plurals and possessive", + valid: true, + tests: [ + { term: "cat", source: "cats", rule: "ns", reasons: ["plural"] }, + { term: "cat", source: "cat's", rule: "ns", reasons: ["possessive"] }, + { + term: "cat", + source: "cats'", + rule: "ns", + reasons: ["plural", "possessive"], + }, + ], + }, +]; + +const languageTransformer = new LanguageTransformer(); +languageTransformer.addDescriptor(englishTransforms); +testLanguageTransformer(languageTransformer, tests); +``` + +The part we want to examine is the `test` array. The other things are common across all test files. + +- `term` is the final form of the deinflected word. +- `source` is the source word to be deinflected to `term`. +- `rule` is the final condition of `term`. Here, we used `ns` because `cat` is a singular noun. +- `reasons` represents the chain of deinflection rules needed to get from `source` to `term`. + +You can check that all the tests pass by running `npm run test:unit`. + +> This command runs all Yomitan unit test files. To only run a single test file, you can instead opt for `npx vitest <file-name>`. + +Now, we may want to verify that `boss` really does not deinflect to `bo`. You can add to the `tests` array: + +```js +{ + category: 'invalid deinflections', + valid: false, + tests: [ + {term: 'boss', source: 'bo', rule: 'ns', reasons: ['plural', 'plural']}, + ], +}, +``` + +Here, by setting `valid` to `false`, we are telling the test function to fail this test case if only `boss` deinflects to `bo` with the `ns` condition under a double `plural` rule. + +You can also optionally pass a `preprocess` helper function to `testLanguageTransformer`. Refer to the language transforms test files for its specific use case. + ### Text Postprocessors In special cases, text may need to be modified after deinflection. These work exactly like text preprocessors, but are applied after deinflection. Currently, this is only used for Korean, where the Hangul text is disassembled into jamo during preprocessing, and so must be reassembled after deinflection. @@ -267,6 +328,10 @@ In special cases, text may need to be modified after deinflection. These work ex Some features include checking whether a string is possibly a word in the language. For example, trying to look up, in an English dictionary, a word written with non-Latin characters (e.g. "日本語") will never yield any results. To prevent unnecessary lookups, an `isTextLookupWorthy` function can be provided, otherwise all text will be looked up. +### Reading Normalizers + +In certain languages, dictionary entries may contain readings as a key to read words, e.g. Kana for Japanese and Pinyin for Chinese. Sometimes, dictionaries may be inconsistent in how they store these readings, leading to the word entries often being split when looked up even though they share the same reading. In these cases, you can use a `readingNormalizer` function to normalize the readings to a common format. + ## Stuck? If you have any questions, please feel free to open a Discussion on Github, or find us on [TheMoeWay Discord](https://discord.gg/UGNPMDE7zC) at [#yomitan-development](https://discord.com/channels/617136488840429598/1081538711742844980). |