Differences

This shows you the differences between two versions of the page.

--- en:background:vocabulary [2022-04-20 12:52] – [Automatic selection of the next word to add] christian
+++ en:background:vocabulary [2022-11-08 12:04] (current) – [Custom selection of a word to add] Restore original spelling of link christian
@@ Line 36: / Line 36: @@
 ==== Custom selection of a word to add ====
-The alternative is that a manually selected word is added to the language because it's needed for some particular reason. If this is the case, the reason for adding that word at that time is always documented in the [[https://gitlab.com/ChristianSi/lugamun/-/blob/main/data/selectionlog.txt?expanded=true&viewer=simple|selection log]]. Entries for such words always start with "Processing entry ... as requested, rationale:" – and then the reason for the addition follows.
+The alternative is that a manually selected concept is added to the language because it's needed for some particular reason. If this is the case, the reason for adding that concept at that time is always documented in the [[https://gitlab.com/ChristianSi/lugamun/-/blob/main/data/selectionlog.txt?expanded=true&viewer=simple|selection log]]. Entries for such words always start with "Processing entry ... as requested, rationale:" – and then the reason for the addition follows.
-One common reason is that words are required to express other words that were automatically selected by addition by the algorithm. For example, the word 'today' is expressed as **si den** in Lugamun, so the words **si** 'this, these' and **den** 'day' had to be found and added first so that it could be added.
+One common reason is that words are needed in order to express other words that were automatically selected for addition by the algorithm. For example, the concept //today (on the current day)// is expressed as **si den** in Lugamun, so the words **si** 'this, these' and **den** 'day' had to be found and added first before it could be expressed and added as well.
-The other common reason is that words are required to express some specific content in Lugamun. For example, to translate the fable [[trans:fen norte e sol|The North Wind and the Sun]], the words **norte** 'north, northern' and many others had first to be found and added to the dictionary.
+The other common reason is that words are needed in order to express some specific content in Lugamun. For example, to translate the fable [[trans:Fen Norte va Sol|The North Wind and the Sun]], the words **norte** 'north, northern' and many others had first to be found and added to the dictionary.
 ===== Can a word in Lugamun represent several concepts? =====
-Yes. XXX Explain now the polysemy check works and how it is used. Also Lugamun's word formation rules ensure that many words represent several concepts, since e.g. all verbs can also be used as nouns.
+Yes. XXX Explain how the polysemy check works and how it is used. Also explain that Lugamun's word formation rules ensure that many words represent several concepts, e.g. all verbs can also be used as nouns.
+===== How does the word selection algorithm work? =====
+Lugamun largely relies on an algorithm that converts the corresponding words from all our [[source languages]] into a form that fits our [[en: grammar:phonology and spelling|phonology]] and then ranks them into an order from "best fit" to "worst fit". For the details of how that ordering works, see [[https://www.reddit.com/r/auxlangs/comments/mlf8h8/vocabulary_selection_for_a_worldlang/|Vocabulary selection for a worldlang]].
+XXX Integrate that article into the wiki and update it as needed (see also file reddit/vocabulary.md). Explain that words are now ranked first by the number of related candidates, only then by penalty.
+This algorithmic order is only a proposal, the ultimate decision on which word to add is made by a human. Often, but not always, it is indeed the best candidate as determined by the algorithm. The rankings determined by the algorithm and the ultimate choice made are always documented in the [[https://gitlab.com/ChristianSi/lugamun/-/blob/main/data/selectionlog.txt?expanded=true&viewer=simple|selection log]]. If the chosen word was **not** the candidate ranked first, then the rationale for that choice is always stated in the selection log. For example, for the word **un** 'one, first', the log states:
+> Candidate #3 "un" added to the dictionary on 2021-07-19 16:01:05.
+> Selection rationale: We prefer this form over 'una/uno' (#1/2), because it's not gendered, shorter, and more international (shared by two languages). It's also related to them.
+===== How are words made into candidates? =====
+Before the words from the different source languages can be compared with each other and ranked, they are first converted into "candidate" form – the form they would have if actually accepted into Lugamun. This form thus corresponds to Lugamun's [[en:grammar:phonology and spelling|phonetic rules and spelling system.]]
+To show a few examples for one of Lugamun's oldest words, representing the concept //[[https://en.wiktionary.org/wiki/white/translations#Adjective|white (bright and colourless)]]//.
+  * The Arabic male form أَبْيَض‎ (ʾabyaḍ) becomes //abyade//. The final //-e// is needed because Lugamun's syllables are not allowed to end in //d//. The female form بَيْضَاء‎ (bayḍāʾ) becomes //baida//.
+  * The Chinese word 白 (bái) becomes //bai//. This was the best-ranked candidate and actually chosen for integration into Lugamun.
+  * The English word //white// (IPA: /waɪt/) becomes //wait//.
+  * French //blanc// becomes //blanke// – again an //-e// is added because Lugamun's phonology doesn't allow syllable-final consonant clusters.
+  * Indonesian //putih// becomes //puti// – syllable-final //-h// is considered less essential for the sound of a word, therefore it is dropped instead of adding //-e// behind it.
+  * Rusisian бе́лый (bélyj) becomes //beli//.
+  * The candidates from the other source languages (Spanish, Hindi, Japanese, Swahili) are converted into candidates in a similar manner.
+The candidate form of any word is automatically created by Lugamun's vocabulary-building algorithm. The algorithm uses the word's original spelling and the sounds it represents as a starting point; if the word is from a language that doesn't use the Latin alphabet, it is based on the representation in the romanization scheme used by Wiktionary for that language. For English words, the sound of the word as represented in the [[wp>International Phonetic Alphabet|IPA]] is used as starting point, since in the case of English the connection between written and spoken form can be fairly loose.
+Generally speaking, the idea is to create candidate words that are close to the original pronunciation, while fitting Lugamun's phonology. To allow adding words from all source languages in a convenient form, there are some rules that just apply to one or a few languages. The most important of these are:
+  * Generally nasalized vowels are handled by adding //-n// after the vowel. But in the case of French, we use either //-n// or //-m//, following the written form. In the case of /ɑ̃/, we also use //e// instead of //a// if that corresponds to the original spelling. Hence //temps// becomes //tem// rather than //tan//. This results in a greater similarity to the written form and also to the corresponding forms in other languages, which often are closer to the written form (such as Spanish //tiempo//, Portuguese //tempo//, English //time// etc.)
+  * Though Japanese has strictly speaking no diphthongs, we convert the vowel combinations //ai, au, oi// to our diphthongs. Phonetically, the difference is small, and this allows for a smoother integration of Japanese words containing these vowel combinations (which would otherwise require an apostrophe or some other changes such as inserting a semivowel between them).
+  * For Mandarin, the plosives (stops) are adapted following their pinyin spellings, hence pinyin //b, d, g// are preserved as such instead of changing them to the voiceless equivalents //p, t, k//. In this way, Mandarin's distinction of unaspirated and aspirated plosives becomes a distinction of voiced and voiceless plosives.
+When generating candidates for verbs, we use the infinitive as starting point, if one exists, otherwise the customary "dictionary form" is used. Typical markers of the infinitive or base form are stripped, when this can be done without making the verb less recognizable. Specifically, this means:
+  * Arabic verbs in their dictionary form usually end in //-a//. Especially in long verbs with three (or more) syllables, this vowel often changes in conjugated forms, therefore it contributes little to making the word recognizable. Hence we strip it in such cases if the result is allowed by Lugamun's phonology. Thus عَبَرَ‎ (ʿabara) becomes //abar//, while عَجِبَ (ʿajiba) remains //ajiba// (since //r// is allowed to end a syllable, but //b// isn't).
+  * Final //-r// is stripped from Spanish and French verbs: Spanish //comer// yields the candidate //kome//; French //écrire// yields //ekri//. In the case of French verbs ending in //-er//, the //e// is stripped as well, if Lugamun's phonology allows this, since final //-e// is silent in French. Hence //opiner// becomes //opin//.
+  * The prefix //meng-// (and its variants) are stripped from Indonesian verbs.
+  * In Swahili, the infinitive prefix //ku-// is stripped, except in the case of short verbs with just two syllables, since these tend to preserve the prefix in many conjugated forms. Hence //kula// remains unchanged, but //kuona// yields the candidate //ona//.
+  * Japanese verbs generally end in //-u//, therefore this final vowel contributes little to making the word recognizable. Hence we strip it from long verbs with three (or more) syllables if the result is allowed by Lugamun's phonology. Thus 腐る (kusaru) becomes //kusar//. In shorter verbs, the final vowel is preserved, since the resulting word would otherwise have only a single syllable, and preserving the second vowel reduces the number of collisions with other words.
+===== How are distortion penalties calculated? =====
+A distortion penalty is applied when Lugamun's pronunciation of a candidate word has to be changed significantly compared to the original word, since an originally used sound doesn't exist in Lugamun or a certain sound combination is not allowed by Lugamun's phonology.
+Generally, certain cases result in "one penalty" or "one penalty point" being applied, while others are not considered sufficiently distorting to deserve a penalty. More specifically,
+  * Changing a single vowel into Lugamun's nearest vowel sound doesn't incur a penalty; e.g. /y/ in French words becomes **u** without a penalty.
+  * Reducing a diphthong to a single vowel incurs a penalty; e.g. English //name//, pronounced /neɪm/, becomes //nem// with a penalty. However, there are certain cases where the original vowel may typically be perceived as a single vowel rather than a diphthong by the speakers of the source language. In such cases, no penalty applies. One example is English /oʊ/ as in //goat//, which becomes //o// without a penalty.
+  * Nasalized vowels are converted to the nearest vowel in Lugamun followed by //n//, without a penalty.
+  * No penalty is applied when converting a consonant to another, as long as both are derived from the same basic letter in IPA. For example, the IPA representations of all rhotic consonants are derived from the letter //r//, therefore they are all converted to //r// without a penalty.
+  * Otherwise a penalty is applied when a consonantal sound doesn't exist in Lugamun and must be converted to the nearest consonant that does exist.
+  * Usually /ŋ/ is changed to //n// without a penalty, following the rules described above. However, in the case of Mandarin, a penalty is applied, since these are essentially the only consonantal endings allowed, making this a pretty severe change.
+  * A penalty is applied when dropping a consonant or adding a vowel (always //e//) for phonotactic reasons.
+To be considered eligible for selection, a word can have at most //one// penalty applied to it. Those with two or more penalties are automatically skipped when sorting candidates. Originally it was just a rule of thumb that such more severely distorted words would be skipped, but meanwhile it has become a inherent part of the candidate generation process.
+===== How are compounds and derived words chosen? =====
+XXX Explain.