182 lines
6.1 KiB
Markdown
182 lines
6.1 KiB
Markdown
# Russian Hyphenation Rules
|
|
|
|
This document defines the rule set for the Russian hyphenation plugin. The
|
|
implementation should return the original word with soft hyphens inserted at
|
|
preferred transfer points. Tests may render soft hyphens as `-` for readability.
|
|
|
|
The goal is not perfect dictionary hyphenation. The goal is a deterministic,
|
|
readable rule set that avoids invalid transfers and produces better break points
|
|
than the current vowel/consonant heuristic.
|
|
|
|
## Character Classes
|
|
|
|
Use lowercase character checks for classification.
|
|
|
|
- Vowels: `а е ё и о у ы э ю я`
|
|
- Consonants: Russian letters except vowels, `ь`, `ъ`, and `й`
|
|
- Non-syllabic letters: `ь ъ`
|
|
- Semivowel: `й`
|
|
|
|
Only hyphenate words made of Russian letters. Leave mixed words, words with
|
|
digits, existing soft hyphens, and abbreviations unchanged. Short words are
|
|
hyphenated only when they pass the same legal side-length and vowel filters as
|
|
longer words.
|
|
|
|
## Hard Legal Filters
|
|
|
|
A candidate break is illegal if any of these is true:
|
|
|
|
- Either side would have fewer than 2 letters.
|
|
- Either side would contain no vowel.
|
|
- The right side starts with `ь`, `ъ`, `й`, or `ы`.
|
|
- The left side ends before `ь` or `ъ`; that is, do not split `под-ъезд` or
|
|
`бол-ьшой`. Prefer `подъ-езд`, `боль-шой`.
|
|
- The left side ends before `й`; that is, do not split `ма-йор` or `во-йна`.
|
|
Prefer `май-ор`, `вой-на`.
|
|
- The break separates a consonant from a following vowel: reject `люб-овь`,
|
|
`паст-ух`, `реб-ята`. Prefer `лю-бовь`, `па-стух` or `пас-тух`, `ре-бята`
|
|
or `ребя-та`.
|
|
|
|
## Candidate Generation
|
|
|
|
Work between adjacent vowel nuclei. For each span from one vowel to the next,
|
|
choose preferred break candidates from the consonant cluster between them.
|
|
|
|
### Adjacent Vowels
|
|
|
|
If two vowels are adjacent, allow a break between them when both resulting parts
|
|
pass the legal filters.
|
|
|
|
Examples:
|
|
|
|
- `поэт` -> `по-эт`
|
|
- `академия` -> `ака-де-мия` and not `а-кадемия` or `академи-я`
|
|
|
|
### One Consonant Between Vowels
|
|
|
|
For `V C V`, break before the consonant.
|
|
|
|
Examples:
|
|
|
|
- `молоко` -> `мо-ло-ко`
|
|
- `корова` -> `ко-ро-ва`
|
|
- `переход` -> `пе-ре-ход`
|
|
|
|
### Two Consonants Between Vowels
|
|
|
|
For `V C C V`, prefer a break between the consonants.
|
|
|
|
Examples:
|
|
|
|
- `лампа` -> `лам-па`
|
|
- `гордый` -> `гор-дый`
|
|
- `письмо` -> `пись-мо`
|
|
|
|
If the cluster contains `й`, `ь`, or `ъ`, keep that letter on the left and break
|
|
after it when legal.
|
|
|
|
Examples:
|
|
|
|
- `майор` -> `май-ор`
|
|
- `подъезд` -> `подъ-езд`
|
|
- `большой` -> `боль-шой`
|
|
|
|
### Three Or More Consonants Between Vowels
|
|
|
|
For longer clusters, prefer the latest break that still leaves a pronounceable
|
|
left part, but keep common inseparable starts on the right:
|
|
|
|
- Keep `ст`, `ск`, `сп`, `сн`, `сл`, `см`, `св` together on the right when
|
|
possible.
|
|
- Keep stop/liquid pairs together on the right when possible: `бр`, `бл`,
|
|
`вр`, `вл`, `гр`, `гл`, `др`, `тр`, `кр`, `кл`, `пр`, `пл`, `фр`, `фл`.
|
|
- Otherwise prefer splitting before the last consonant in the cluster.
|
|
|
|
Examples:
|
|
|
|
- `сестра` -> `се-стра`
|
|
- `острый` should not use `о-стрый`, because the left side is too short
|
|
- `родство` -> `род-ство`
|
|
- `чувство` -> `чув-ство`
|
|
- `предложение` -> `пред-ло-же-ние`
|
|
|
|
## Double Consonants
|
|
|
|
When two identical consonants stand between vowels, prefer splitting between
|
|
them.
|
|
|
|
Examples:
|
|
|
|
- `масса` -> `мас-са`
|
|
- `длинный` -> `длин-ный`
|
|
- `касса` -> `кас-са`
|
|
|
|
Do not force this rule when the double consonant starts a root after a prefix.
|
|
Without a dictionary, this exception is hard to detect, so implementation may
|
|
leave such words to the general cluster logic.
|
|
|
|
## Prefix-Like Boundaries
|
|
|
|
Without a morphology dictionary, treat these as preferred heuristics only.
|
|
|
|
If a word starts with a common prefix and the following part is legal, prefer a
|
|
break after the prefix:
|
|
|
|
- `без`, `бес`, `воз`, `вос`, `вз`, `вс`, `из`, `ис`, `низ`, `нис`, `раз`,
|
|
`рас`, `роз`, `рос`, `от`, `об`, `объ`, `под`, `подъ`, `пред`, `пере`,
|
|
`при`, `про`, `над`, `сверх`, `меж`
|
|
|
|
Examples:
|
|
|
|
- `подбить` -> `под-бить`
|
|
- `размах` -> `раз-мах`
|
|
- `предложение` -> `пред-ло-же-ние`
|
|
- `подъезд` -> `подъ-езд`
|
|
|
|
Do not create a right side starting with `ы`; prefer later legal breaks.
|
|
|
|
Examples:
|
|
|
|
- `разыскать` -> `ра-зыскать` or `разыс-кать`, not `раз-ыскать`
|
|
- `розыгрыш` -> `ро-зыгрыш` or `розыг-рыш`, not `роз-ыгрыш`
|
|
|
|
## Ranking
|
|
|
|
A word may have several legal break points. The plugin should insert all good
|
|
breaks, but it should avoid noisy low-quality breaks. Use this ranking:
|
|
|
|
1. Prefix boundary, if legal.
|
|
2. Double consonant split between vowels.
|
|
3. Syllable breaks from adjacent vowel and consonant-cluster rules.
|
|
4. Longer-cluster fallback break before the last consonant.
|
|
|
|
Reject candidates that are legal but awkward when a better candidate is within
|
|
one character and both candidates divide the same vowel-to-vowel span.
|
|
|
|
## Example Expectations
|
|
|
|
These strings use `-` where the implementation will insert `SoftHyphen`.
|
|
|
|
```text
|
|
молоко -> мо-ло-ко
|
|
корова -> ко-ро-ва
|
|
яблоко -> яб-ло-ко
|
|
повествование -> по-вест-во-ва-ние
|
|
предложение -> пред-ло-же-ние
|
|
компьютер -> компью-тер
|
|
подъезд -> подъ-езд
|
|
большой -> боль-шой
|
|
майор -> май-ор
|
|
масса -> мас-са
|
|
длинный -> длин-ный
|
|
разыскать -> ра-зыс-кать
|
|
розыгрыш -> ро-зыг-рыш
|
|
```
|
|
|
|
## Non-Goals
|
|
|
|
- Full dictionary-level hyphenation.
|
|
- Stress-aware syllabification.
|
|
- Exact morpheme detection for every prefix/root boundary.
|
|
- Hyphenating proper abbreviations and mixed-script technical identifiers.
|