# Russian Hyphenation Rules This document defines the rule set for the Russian hyphenation plugin. The implementation should return the original word with soft hyphens inserted at preferred transfer points. Tests may render soft hyphens as `-` for readability. The goal is not perfect dictionary hyphenation. The goal is a deterministic, readable rule set that avoids invalid transfers and produces better break points than the current vowel/consonant heuristic. ## Character Classes Use lowercase character checks for classification. - Vowels: `а е ё и о у ы э ю я` - Consonants: Russian letters except vowels, `ь`, `ъ`, and `й` - Non-syllabic letters: `ь ъ` - Semivowel: `й` Only hyphenate words made of Russian letters. Leave mixed words, words with digits, existing soft hyphens, and abbreviations unchanged. Short words are hyphenated only when they pass the same legal side-length and vowel filters as longer words. ## Hard Legal Filters A candidate break is illegal if any of these is true: - Either side would have fewer than 2 letters. - Either side would contain no vowel. - The right side starts with `ь`, `ъ`, `й`, or `ы`. - The left side ends before `ь` or `ъ`; that is, do not split `под-ъезд` or `бол-ьшой`. Prefer `подъ-езд`, `боль-шой`. - The left side ends before `й`; that is, do not split `ма-йор` or `во-йна`. Prefer `май-ор`, `вой-на`. - The break separates a consonant from a following vowel: reject `люб-овь`, `паст-ух`, `реб-ята`. Prefer `лю-бовь`, `па-стух` or `пас-тух`, `ре-бята` or `ребя-та`. ## Candidate Generation Work between adjacent vowel nuclei. For each span from one vowel to the next, choose preferred break candidates from the consonant cluster between them. ### Adjacent Vowels If two vowels are adjacent, allow a break between them when both resulting parts pass the legal filters. Examples: - `поэт` -> `по-эт` - `академия` -> `ака-де-мия` and not `а-кадемия` or `академи-я` ### One Consonant Between Vowels For `V C V`, break before the consonant. Examples: - `молоко` -> `мо-ло-ко` - `корова` -> `ко-ро-ва` - `переход` -> `пе-ре-ход` ### Two Consonants Between Vowels For `V C C V`, prefer a break between the consonants. Examples: - `лампа` -> `лам-па` - `гордый` -> `гор-дый` - `письмо` -> `пись-мо` If the cluster contains `й`, `ь`, or `ъ`, keep that letter on the left and break after it when legal. Examples: - `майор` -> `май-ор` - `подъезд` -> `подъ-езд` - `большой` -> `боль-шой` ### Three Or More Consonants Between Vowels For longer clusters, prefer the latest break that still leaves a pronounceable left part, but keep common inseparable starts on the right: - Keep `ст`, `ск`, `сп`, `сн`, `сл`, `см`, `св` together on the right when possible. - Keep stop/liquid pairs together on the right when possible: `бр`, `бл`, `вр`, `вл`, `гр`, `гл`, `др`, `тр`, `кр`, `кл`, `пр`, `пл`, `фр`, `фл`. - Otherwise prefer splitting before the last consonant in the cluster. Examples: - `сестра` -> `се-стра` - `острый` should not use `о-стрый`, because the left side is too short - `родство` -> `род-ство` - `чувство` -> `чув-ство` - `предложение` -> `пред-ло-же-ние` ## Double Consonants When two identical consonants stand between vowels, prefer splitting between them. Examples: - `масса` -> `мас-са` - `длинный` -> `длин-ный` - `касса` -> `кас-са` Do not force this rule when the double consonant starts a root after a prefix. Without a dictionary, this exception is hard to detect, so implementation may leave such words to the general cluster logic. ## Prefix-Like Boundaries Without a morphology dictionary, treat these as preferred heuristics only. If a word starts with a common prefix and the following part is legal, prefer a break after the prefix: - `без`, `бес`, `воз`, `вос`, `вз`, `вс`, `из`, `ис`, `низ`, `нис`, `раз`, `рас`, `роз`, `рос`, `от`, `об`, `объ`, `под`, `подъ`, `пред`, `пере`, `при`, `про`, `над`, `сверх`, `меж` Examples: - `подбить` -> `под-бить` - `размах` -> `раз-мах` - `предложение` -> `пред-ло-же-ние` - `подъезд` -> `подъ-езд` Do not create a right side starting with `ы`; prefer later legal breaks. Examples: - `разыскать` -> `ра-зыскать` or `разыс-кать`, not `раз-ыскать` - `розыгрыш` -> `ро-зыгрыш` or `розыг-рыш`, not `роз-ыгрыш` ## Ranking A word may have several legal break points. The plugin should insert all good breaks, but it should avoid noisy low-quality breaks. Use this ranking: 1. Prefix boundary, if legal. 2. Double consonant split between vowels. 3. Syllable breaks from adjacent vowel and consonant-cluster rules. 4. Longer-cluster fallback break before the last consonant. Reject candidates that are legal but awkward when a better candidate is within one character and both candidates divide the same vowel-to-vowel span. ## Example Expectations These strings use `-` where the implementation will insert `SoftHyphen`. ```text молоко -> мо-ло-ко корова -> ко-ро-ва яблоко -> яб-ло-ко повествование -> по-вест-во-ва-ние предложение -> пред-ло-же-ние компьютер -> компью-тер подъезд -> подъ-езд большой -> боль-шой майор -> май-ор масса -> мас-са длинный -> длин-ный разыскать -> ра-зыс-кать розыгрыш -> ро-зыг-рыш ``` ## Non-Goals - Full dictionary-level hyphenation. - Stress-aware syllabification. - Exact morpheme detection for every prefix/root boundary. - Hyphenating proper abbreviations and mixed-script technical identifiers.