Skip to content

Added Sinhala test cases for spaCy, implemented si_tokenizer and si_vocab, and fixed version issues in the requirements file.#13970

Open
AshenWELI wants to merge 1 commit into
explosion:masterfrom
AshenWELI:test_si
Open

Added Sinhala test cases for spaCy, implemented si_tokenizer and si_vocab, and fixed version issues in the requirements file.#13970
AshenWELI wants to merge 1 commit into
explosion:masterfrom
AshenWELI:test_si

Conversation

@AshenWELI
Copy link
Copy Markdown

@AshenWELI AshenWELI commented May 20, 2026


title: Adding Sinhala (si) language test cases

Description

Previously, spaCy had no test coverage for the Sinhala (si) language.
This PR adds a comprehensive test suite for the Sinhala tokenizer and lexical
attributes, covering real-world text sourced from BBC Sinhala news articles.

Changes introduced

New file: spacy/tests/lang/si/test_text.py
spacy/tests/lang/si/test_tokenizer.py
spacy/tests/lang/si/__init__.py
Edited file spacy/tests/conftest.py

The following test cases were added:

  • test_si_tokenizer_handles_long_text — validates correct tokenization
    of a long real-world Sinhala news paragraph (83 tokens), sourced from
    BBC Sinhala.

  • test_si_tokenizer_handles_cnts — parametrized tests covering a variety
    of Sinhala sentence structures including punctuation, numbers, question marks,
    semicolons, and mixed numeric-Sinhala tokens (e.g. 110,000කට).

  • test_lex_attrs_like_number — validates the like_num lexical attribute
    for Sinhala numeric tokens including digits, fractions, comma-formatted numbers,
    Sinhala cardinal number words (e.g. එක, දෙක), and large number words
    (e.g. බිලියනය).

  • test_si_lex_attrs_like_number_for_ordinal — validates like_num for
    Sinhala ordinal forms including suffix-attached ordinals (තුන්වන, සියවන)
    and space-separated ordinals (100 වන, 23 වෙනි).

  • test_si_lex_attrs_capitals — confirms that like_num handles Sinhala
    words correctly regardless of .upper() calls, since Sinhala script has no
    concept of upper or lower case.

  • test_si_tokenizer_handles_exception_cases — validates tokenizer
    exception handling for punctuation splitting, ordinal numbers (10වෙනි
    staying as one token), and conjunct character sequences (ශ්‍රී, ත්‍රස්තවාදීන්).

Notes

  • Sinhala has no upper or lower case — .lower() and .upper() calls have no
    effect on Sinhala script, which is reflected in the test design.
  • All test sentences were sourced from authentic Sinhala news media to ensure
    real-world coverage.
  • The like_num function in lex_attrs.py was extended to support Sinhala
    ordinal suffixes (වන, වෙනි) and additional cardinal number words.

Types of change

New feature - adds missing test coverage for the Sinhala (si) language module.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

…ocab, and fixed version issues in the requirements file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant