-
-
Notifications
You must be signed in to change notification settings - Fork 3
Key terms updates necessary for use in SILNLP #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Also, update machine.py library version |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #257 +/- ##
==========================================
- Coverage 90.74% 90.64% -0.11%
==========================================
Files 352 354 +2
Lines 22337 22485 +148
==========================================
+ Hits 20270 20381 +111
- Misses 2067 2104 +37 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit partially reviewed 11 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93).
machine/corpora/key_term_row.py line 0 at r1 (raw file):
This file should be named key_term.py.
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 made 3 comments.
Reviewable status: 8 of 21 files reviewed, 1 unresolved discussion (waiting on @ddaspit).
machine/corpora/key_term_row.py line at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This file should be named
key_term.py.
Done.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 362 at r3 (raw file):
).tokens() src_term_partial_word_tokens.remove("▁") src_term_partial_word_tokens.remove("\ufffc")
This is mirroring code in silnlp more-or-less exactly. I made an issue for creating a shared utility function that could do some of this. I also experimented with finding a safe way to be able to do this with non-fast tokenizers. It's something we should look into as needed but I decided that it was taking too much time.
tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 130 at r3 (raw file):
corpus = source_corpus.align_rows(target_corpus) terms_corpus = DictionaryTextCorpus(MemoryText("terms", [TextRow("terms", 1, ["telephone"])])).align_rows(
I don't love that this test doesn't really cover whether the terms are affecting the result. I just stuck this in here for code coverage (no exceptions thrown, etc.), but I couldn't adapt our one true fine-tuning test because it uses a non-fast tokenizer. I looked for alternatives but couldn't find anything that works. I did confirm in the debugger that everything was being tokenized properly. Maybe we should consider outputting some kind of artifact in ClearML (?) with the tokenized data so we have something to compare apples-to-apples to the tokenized experiment txt files in silnlp.
Added support for capturing renderings patterns, references, and term domains. Moved to using a
KeyTermdata structure rather than tuples.(This also includes porting of recent changes in Machine sillsdev/machine#362 and sillsdev/machine#368)
This change is