Linguistic Influence on Multidimensional Word Embeddings: Analysis of Ten Languagesстатья
Информация о цитировании статьи получена из
Scopus
Статья опубликована в журнале из списка Web of Science и/или Scopus
Дата последнего поиска статьи во внешних источниках: 4 марта 2026 г.
Аннотация:Abstract
Understanding how linguistic typology shapes multilingual embeddings is important for cross-lingual NLP. We examine static MUSE word embedding for ten diverse languages (English, Russian, Chinese, Arabic, Indonesian, German, Lithuanian, Hindi, Tajik and Persian). Using pairwise cosine distances, Random Forest classification, and UMAP visualization, we find that language identity and script type largely determine embedding clusters, with morphological complexity affecting cluster compactness and lexical overlap connecting clusters. The Random Forest model predicts language labels with high accuracy (≈98%), indicating strong language-specific patterns in embedding space. These results highlight script, morphology, and lexicon as key factors influencing multilingual embedding structures, informing linguistically aware design of cross-lingual models.
Keywords: word embeddings; MUSE; random forest; UMAP; cross-lingual NLP; language typology; multilingual data