Added Sinhala test cases for spaCy, implemented si_tokenizer and si_vocab, and fixed version issues in the requirements file.#13970
Open
AshenWELI wants to merge 1 commit into
Open
Conversation
…ocab, and fixed version issues in the requirements file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
title: Adding Sinhala (si) language test cases
Description
Previously, spaCy had no test coverage for the Sinhala (
si) language.This PR adds a comprehensive test suite for the Sinhala tokenizer and lexical
attributes, covering real-world text sourced from BBC Sinhala news articles.
Changes introduced
New file:
spacy/tests/lang/si/test_text.pyspacy/tests/lang/si/test_tokenizer.pyspacy/tests/lang/si/__init__.pyEdited file
spacy/tests/conftest.pyThe following test cases were added:
test_si_tokenizer_handles_long_text— validates correct tokenizationof a long real-world Sinhala news paragraph (83 tokens), sourced from
BBC Sinhala.
test_si_tokenizer_handles_cnts— parametrized tests covering a varietyof Sinhala sentence structures including punctuation, numbers, question marks,
semicolons, and mixed numeric-Sinhala tokens (e.g.
110,000කට).test_lex_attrs_like_number— validates thelike_numlexical attributefor Sinhala numeric tokens including digits, fractions, comma-formatted numbers,
Sinhala cardinal number words (e.g.
එක,දෙක), and large number words(e.g.
බිලියනය).test_si_lex_attrs_like_number_for_ordinal— validateslike_numforSinhala ordinal forms including suffix-attached ordinals (
තුන්වන,සියවන)and space-separated ordinals (
100 වන,23 වෙනි).test_si_lex_attrs_capitals— confirms thatlike_numhandles Sinhalawords correctly regardless of
.upper()calls, since Sinhala script has noconcept of upper or lower case.
test_si_tokenizer_handles_exception_cases— validates tokenizerexception handling for punctuation splitting, ordinal numbers (
10වෙනිstaying as one token), and conjunct character sequences (
ශ්රී,ත්රස්තවාදීන්).Notes
.lower()and.upper()calls have noeffect on Sinhala script, which is reflected in the test design.
real-world coverage.
like_numfunction inlex_attrs.pywas extended to support Sinhalaordinal suffixes (
වන,වෙනි) and additional cardinal number words.Types of change
New feature - adds missing test coverage for the Sinhala (
si) language module.Checklist