Source languages

What are Lugamun's source languages and why?

Lugamun has ten source language: Arabic, English, French, Hindustani (Hindi/Urdu), Indonesian/Malay, Japanese, Mandarin Chinese, Russian, Spanish, and Swahili. Five of these languages belong to the Indo-European family, the language family that has by far the highest number of speakers (it’s spoken by about 40% of the world population). The other five represent five other particularly widely spoken language families.

The exact method for selecting these languages was as follows:

For the Indo-European languages – by far the most widely spoken language family in the world – we select the biggest language from each subfamily (or branch), provided that that language has at least 100 (or 50, it doesn’t really matter) million speakers. This results in four source languages: English (Germanic branch), Hindustani (Hindi/Urdu, Indo-Iranian branch), Spanish (Italic branch), and Russian (Balto-Slavic branch).
For each of the four next biggest language families (all of which have more than 300 million speakers in total), we use the most widely spoken language: Mandarin Chinese (Sino-Tibetan family), Swahili (Niger-Congo family), Standard Arabic (Afroasiatic family), and Indonesian (Austronesian family).
We also add French (the second most widely spoken Italic language), since it is one of the official languages of the United Nations – the only official language not yet in our list. French vies with Bengali in being the most widely spoken language not yet in our list – but it is arguably more international, being an official language in more than 30 countries (the second highest number after English), while Bengali is official only in Bangladesh and parts of India.
To avoid having more Indo-European than other languages and to increase diversity, we also add the most widely spoken language from a family not yet represented: Japanese (Japonic family).

This choice was made in July 2021.

Sources:

Ethnologue: What are the largest language families?
Reddit: The world's 30 most widely spoken languages (Mar 2021)
Wikipedia: List of language families
Wikipedia: List of languages by the number of countries in which they are recognized as an official language
Wikipedia articles on language families and individual languages
Worldometer: Current World Population

Are different source languages treated differently?

Yes. A distinction is made between the five most widely spoken languages (the “top 5”) and the other five source languages (the “next 5”). A candidate word must be from one of the top 5 languages or it must have a related candidate in another language to be eligible for selection.

This means that words from the “next 5” (French, Russian, Indonesian/Malay, Japanese, and Swahili) are not considered candidate words unless they have a related word (a true or false cognate) in any of the other nine source languages. For example, the word to ‘that’ is based on Japanese と (to) and related to Russian что (što) – without this related candidate, it would not have been eligible for selection and hence could not have made it into the dictionary.

On the other hand, candidates from the “top 5” (English, Mandarin Chinese. Hindustani, Arabic, and Spanish) are eligible for selection even if they don’t have any related candidate. For example, tvi ‘leg’ is from Chinese 腿 (tuǐ); there are no related (similar) words in any of the other source languages.

All candidate words are sorted first by the number of related candidates and only then by their total penalty, which means that words that have at least one related candidate will always be preferred over those that have none. Hence the candidates from the “top 5” languages without any related candidates will be placed at the end of the candidate list, after all candidates that do have related candidates. So they can be considered as “choices of last resort” that are only considered if no candidate word has (true or false) cognates in other source languages.

The reason for limiting these “choices of last resort” to the top 5 languages is that it helps to ensure that all of Lugamun’s words will be recognizable to a considerable number of people. Even without related candidates, many people will recognize words from English or Mandarin – much more than would recognize a word from Japanese or Swahili. This is why only the former, but not the latter, become “choices of last resort.”

In cases where no word has any related candidates in other languages, words from the top 5 languages will therefore always be chosen, since only they are eligible. In all other cases, words from any of the source languages may be chosen, based on their overall penalty.

Nevertheless, the fact that the top 5 languages are sometimes privileged gives them a visible advantage in the influence statistics, where the top 5 (plus French) all have a higher influence than the other 4. French, thought not a top 5 language and so never yielding “choices of last resort,” manages to retain a very high influence because its words are often related to the words used in other source languages (especially with Spanish and English, but also with Russian and even Indonesian). On the other hand, Mandarin quite rarely shares words with any other source language, and since Lugamun’s selection algorithm always prefers shared words, its influence is therefore lower than that of any other top 5 language.

(In a few exceptional cases, words from the next 5 languages may be chosen even without the support of a related word. But this is only the case in the rare situation that none of the normally eligible candidates is suitable, so that the usual selection criteria need to be relaxed. If this is the case, the exceptional choice is always explained and justified in the selection log. One example where this is the case is the optional object preposition o (from Japanese), which was accepted because none of the top 5 languages has an object marker/preposition that could replace it.)

Why do you consider Hindustani a single language?

Because most linguists do so. Hindi and Urdu are listed as separate languages in the dictionary because they use different writing systems, but that doesn’t mean they should be treated as different languages for the purposes of generating candidate words – rather, doing so would just yield identical duplicate candidates for nearly all words. It would be similar to considering American English and British English or Brazilian Portuguese and Portuguese Portuguese as different languages – yes, there are differences, but they are so small that it makes more sense to consider them as variants of the same language rather than as different languages.

Table of Contents

Source languages

What are Lugamun's source languages and why?

Are different source languages treated differently?

Why do you consider Hindustani a single language?