Vocabulary

In which order are words added?

There are essentially two ways in which Lugamun’s vocabulary is growing:

New concepts are chosen for addition by the algorithm.
New concepts are added because they are needed for some specific reason, say to translate a text, to express a sentence, or to expand the grammar.

(1) was used exclusively for the first 100 words of the language. Since then, both (1) and (2) have been used together, sometimes letting the algorithm select concepts for addition and sometimes explicitly specify them in order to fill a particular gap in the dictionary.

What do we mean by concept here? Essentially, a specific unit of meaning that can become a word in Lugamun. Wiktionary, which provides much of the data used to create Lugamun’s vocabulary, distinguishes concepts by grouping the different meanings of a word first by word class – a word can be an adjective, noun, verb etc. – and then listing the specific meanings (often more than one) for each word/class pair and collecting translations separately by meaning.

For example, the English word water is used both as a noun and as a verb. For the noun meaning, Wiktionary gives several definitions – among them “clear liquid H₂O” and “body of water, or specific part of it” – and then lists translations into other languages for each of these meanings. With concept we mean here such a combination of an English word with one specific sense definition, for example water (clear liquid H₂O).

Automatic selection of the next word to add

Lugamun’s algorithm used for selecting words is somewhat state-dependent – words from a source language whose current influence is low get a higher chance of being selected, and vice versa. Therefore the order in which words are selected matters to some degree.

But where to start – which words to add first? Intuitively, it makes sense to start with words that are particularly fundamental and widespread. But how to formalize this?

Since Lugamun’s algorithm relies on translations listed in Wiktionary, an initial idea was to start with concepts that are represented in a high number of languages, and documented in Wiktionary as such. So, prior to proposing a sorted list of candidate words for any given concept as documented, the algorithm first decides which concept should be added next, starting with those concepts that have the highest number of translations into other languages in Wiktionary.

The concept with the highest number of translations is water (clear liquid H₂O), for which Wiktionary lists translations into more than 3000 languages. This was indeed the first word added to Lugamun, though the chosen word was later revised and replaced.

One problem with only following translations counts, however, would be that most of the words with a very high number of translations are nouns. To avoid creating a core vocabulary made up of lots of nouns and not much else, the words in Wiktionary are sorted into three groups:

nouns
adjectives and adverbs
verbs and all other word classes (numerals, pronouns etc.)

The word selection algorithm proceeds in such a way as to ensure that these three groups are equally represented in the dictionary. Since the first word added was a noun, the second word must come from group (2) or (3). Among these, the numeral un ‘one’ has the highest number of translations, so it was added second – now it is the oldest word in Lugamun’s dictionary that has survived without later changes. This word belongs to group (3), hence the third word had to be an adjective or adverb – among these, bai ‘white’ had the highest number of translations and was added next. After that, the algorithm was again free to add a word from any of the groups, since all three were now evenly distributed. While the process continues, the algorithm always ensures that one third of the core vocabulary comes from each of the three groups.

Custom selection of a word to add

The alternative is that a manually selected concept is added to the language because it’s needed for some particular reason. If this is the case, the reason for adding that concept at that time is always documented in the selection log. Entries for such words always start with “Processing entry … as requested, rationale:” – and then the reason for the addition follows.

One common reason is that words are needed in order to express other words that were automatically selected for addition by the algorithm. For example, the concept today (on the current day) is expressed as si den in Lugamun, so the words si ‘this, these’ and den ‘day’ had to be found and added first before it could be expressed and added as well.

The other common reason is that words are needed in order to express some specific content in Lugamun. For example, to translate the fable The North Wind and the Sun, the words norte ‘north, northern’ and many others had first to be found and added to the dictionary.

Can a word in Lugamun represent several concepts?

Yes. XXX Explain how the polysemy check works and how it is used. Also explain that Lugamun’s word formation rules ensure that many words represent several concepts, e.g. all verbs can also be used as nouns.

How does the word selection algorithm work?

Lugamun largely relies on an algorithm that converts the corresponding words from all our source languages into a form that fits our phonology and then ranks them into an order from “best fit” to “worst fit”. For the details of how that ordering works, see Vocabulary selection for a worldlang.

XXX Integrate that article into the wiki and update it as needed (see also file reddit/vocabulary.md). Explain that words are now ranked first by the number of related candidates, only then by penalty.

This algorithmic order is only a proposal, the ultimate decision on which word to add is made by a human. Often, but not always, it is indeed the best candidate as determined by the algorithm. The rankings determined by the algorithm and the ultimate choice made are always documented in the selection log. If the chosen word was not the candidate ranked first, then the rationale for that choice is always stated in the selection log. For example, for the word un ‘one, first’, the log states:

Candidate #3 “un” added to the dictionary on 2021-07-19 16:01:05.
Selection rationale: We prefer this form over ‘una/uno’ (#1/2), because it’s not gendered, shorter, and more international (shared by two languages). It’s also related to them.

How are words made into candidates?

Before the words from the different source languages can be compared with each other and ranked, they are first converted into “candidate” form – the form they would have if actually accepted into Lugamun. This form thus corresponds to Lugamun’s phonetic rules and spelling system.

To show a few examples for one of Lugamun’s oldest words, representing the concept white (bright and colourless).

The Arabic male form أَبْيَض‎ (ʾabyaḍ) becomes abyade. The final -e is needed because Lugamun’s syllables are not allowed to end in d. The female form بَيْضَاء‎ (bayḍāʾ) becomes baida.
The Chinese word 白 (bái) becomes bai. This was the best-ranked candidate and actually chosen for integration into Lugamun.
The English word white (IPA: /waɪt/) becomes wait.
French blanc becomes blanke – again an -e is added because Lugamun’s phonology doesn’t allow syllable-final consonant clusters.
Indonesian putih becomes puti – syllable-final -h is considered less essential for the sound of a word, therefore it is dropped instead of adding -e behind it.
Rusisian бе́лый (bélyj) becomes beli.
The candidates from the other source languages (Spanish, Hindi, Japanese, Swahili) are converted into candidates in a similar manner.

The candidate form of any word is automatically created by Lugamun’s vocabulary-building algorithm. The algorithm uses the word’s original spelling and the sounds it represents as a starting point; if the word is from a language that doesn’t use the Latin alphabet, it is based on the representation in the romanization scheme used by Wiktionary for that language. For English words, the sound of the word as represented in the IPA is used as starting point, since in the case of English the connection between written and spoken form can be fairly loose.

Generally speaking, the idea is to create candidate words that are close to the original pronunciation, while fitting Lugamun’s phonology. To allow adding words from all source languages in a convenient form, there are some rules that just apply to one or a few languages. The most important of these are:

Generally nasalized vowels are handled by adding -n after the vowel. But in the case of French, we use either -n or -m, following the written form. In the case of /ɑ̃/, we also use e instead of a if that corresponds to the original spelling. Hence temps becomes tem rather than tan. This results in a greater similarity to the written form and also to the corresponding forms in other languages, which often are closer to the written form (such as Spanish tiempo, Portuguese tempo, English time etc.)
Though Japanese has strictly speaking no diphthongs, we convert the vowel combinations ai, au, oi to our diphthongs. Phonetically, the difference is small, and this allows for a smoother integration of Japanese words containing these vowel combinations (which would otherwise require an apostrophe or some other changes such as inserting a semivowel between them).
For Mandarin, the plosives (stops) are adapted following their pinyin spellings, hence pinyin b, d, g are preserved as such instead of changing them to the voiceless equivalents p, t, k. In this way, Mandarin’s distinction of unaspirated and aspirated plosives becomes a distinction of voiced and voiceless plosives.

When generating candidates for verbs, we use the infinitive as starting point, if one exists, otherwise the customary “dictionary form” is used. Typical markers of the infinitive or base form are stripped, when this can be done without making the verb less recognizable. Specifically, this means:

Arabic verbs in their dictionary form usually end in -a. Especially in long verbs with three (or more) syllables, this vowel often changes in conjugated forms, therefore it contributes little to making the word recognizable. Hence we strip it in such cases if the result is allowed by Lugamun’s phonology. Thus عَبَرَ‎ (ʿabara) becomes abar, while عَجِبَ (ʿajiba) remains ajiba (since r is allowed to end a syllable, but b isn’t).
Final -r is stripped from Spanish and French verbs: Spanish comer yields the candidate kome; French écrire yields ekri. In the case of French verbs ending in -er, the e is stripped as well, if Lugamun’s phonology allows this, since final -e is silent in French. Hence opiner becomes opin.
The prefix meng- (and its variants) are stripped from Indonesian verbs.
In Swahili, the infinitive prefix ku- is stripped, except in the case of short verbs with just two syllables, since these tend to preserve the prefix in many conjugated forms. Hence kula remains unchanged, but kuona yields the candidate ona.
Japanese verbs generally end in -u, therefore this final vowel contributes little to making the word recognizable. Hence we strip it from long verbs with three (or more) syllables if the result is allowed by Lugamun’s phonology. Thus 腐る (kusaru) becomes kusar. In shorter verbs, the final vowel is preserved, since the resulting word would otherwise have only a single syllable, and preserving the second vowel reduces the number of collisions with other words.

How are distortion penalties calculated?

A distortion penalty is applied when Lugamun’s pronunciation of a candidate word has to be changed significantly compared to the original word, since an originally used sound doesn’t exist in Lugamun or a certain sound combination is not allowed by Lugamun’s phonology.

Generally, certain cases result in “one penalty” or “one penalty point” being applied, while others are not considered sufficiently distorting to deserve a penalty. More specifically,

Changing a single vowel into Lugamun’s nearest vowel sound doesn’t incur a penalty; e.g. /y/ in French words becomes u without a penalty.
Reducing a diphthong to a single vowel incurs a penalty; e.g. English name, pronounced /neɪm/, becomes nem with a penalty. However, there are certain cases where the original vowel may typically be perceived as a single vowel rather than a diphthong by the speakers of the source language. In such cases, no penalty applies. One example is English /oʊ/ as in goat, which becomes o without a penalty.
Nasalized vowels are converted to the nearest vowel in Lugamun followed by n, without a penalty.
No penalty is applied when converting a consonant to another, as long as both are derived from the same basic letter in IPA. For example, the IPA representations of all rhotic consonants are derived from the letter r, therefore they are all converted to r without a penalty.
Otherwise a penalty is applied when a consonantal sound doesn’t exist in Lugamun and must be converted to the nearest consonant that does exist.
Usually /ŋ/ is changed to n without a penalty, following the rules described above. However, in the case of Mandarin, a penalty is applied, since these are essentially the only consonantal endings allowed, making this a pretty severe change.
A penalty is applied when dropping a consonant or adding a vowel (always e) for phonotactic reasons.

To be considered eligible for selection, a word can have at most one penalty applied to it. Those with two or more penalties are automatically skipped when sorting candidates. Originally it was just a rule of thumb that such more severely distorted words would be skipped, but meanwhile it has become a inherent part of the candidate generation process.

How are compounds and derived words chosen?

XXX Explain.

Table of Contents