Vocab specials by mttk · Pull Request #230 · TakeLab/podium

mttk · 2020-12-01T17:46:18Z

Some outstanding things:

Closes #216

ivansmokovic

LGTM.
I have some questions.
Not sure about Specials subclassing str. What is the main benefit of this compared to a simple Special class?

podium/storage/field.py

ivansmokovic · 2020-12-23T21:33:25Z

podium/storage/vocab.py



-class VocabDict(dict):
+class Special(str):


Is the user expected to implement own Specials at some point or are all Specials maintained by us?

It is intended for the user to subclass specials in case of new usages (e.g., Mask is a relatively new case).

podium/storage/vocab.py

podium/storage/field.py

podium/storage/vocab.py

podium/storage/field.py

docs/source/advanced.rst

FilipBolt · 2021-01-08T14:57:55Z

docs/source/advanced.rst

+  2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method.
+  3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each Vocab has a single instance of each Special and for simpler referencing and contains checks.
+
+To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token.


Do you maybe happen to know a resource which contains typical Specials used in NLP we could link here? After a quick Google search I could not find one.

Vocabs in transformers (or tokenizers? not sure where they delegated the vocab) had quite a large number of reserved tokens.

Yes, but this is the best I could find: https://huggingface.co/transformers/main_classes/tokenizer.html#pretrainedtokenizer

docs/source/advanced.rst

FilipBolt · 2021-01-08T15:12:28Z

docs/source/specials.rst

@@ -0,0 +1,29 @@
+Special tokens
+===============
+.. autoclass:: podium.storage.vocab.Special


this should (eventually) get refactored to omit storage, but looks good otherwise.

Yes, IDK while docs refuse to use the shortened versions. I have an idea and might try it out soon.

docs/source/advanced.rst

podium/storage/field.py

tests/storage/test_vocab.py

docs/source/advanced.rst

mttk · 2021-01-12T17:02:12Z

@FilipBolt all comments addressed

mariosasko · 2021-01-12T21:00:21Z

@mttk Can you please reference the issue that this PR will close?

FilipBolt

Looks good to me, few minor comments left, but good to go in my mind

docs/source/advanced.rst

podium/vocab.py

FilipBolt · 2021-01-13T14:51:14Z

podium/vocab.py

+                [self.stoi[token] if token in self.stoi else unk_token for token in data]
+            )
+        else:
+            # Either UNK is not in Vocab or the user has requested unknown tokens


Makes sense, we might need to update the docs to reflect what happens when UNK is in/out of the vocab.

Documented.

mttk added 9 commits December 1, 2020 18:42

Initial proposal

d22723c

Merge w master

9cc3217

init -> new for bos

6a74734

Add UNK filtering, static constructors

d598afa

Add uniqueness check for specials

91f86e2

merge

64021af

Disable caching for nondeterministic numericalizers, style changes

6736b74

Finalize masking functionality

ccdd440

black, isort

64a8cd7

mttk requested review from FilipBolt, ivansmokovic and mariosasko and removed request for mariosasko December 16, 2020 00:25

FilipBolt linked an issue Dec 17, 2020 that may be closed by this pull request

Proposal: specials revamp #216

Closed

mttk added 10 commits December 18, 2020 12:10

Stash

2f4e9ee

Remove maskvocab, document specials, fix fields

681311d

Fix previous tests

5397980

Wrap up tests & style

a5b671b

Flake

e37f10f

Apply specials

329b7f3

Typo

9d4a374

Merge branch 'master' into vocab_specials

d9858a8

That's what i get for using the web editor

42d74e6

Black

84040b6

ivansmokovic approved these changes Dec 23, 2020

View reviewed changes

mariosasko reviewed Dec 27, 2020

View reviewed changes

podium/storage/field.py Outdated Show resolved Hide resolved

podium/storage/vocab.py Outdated Show resolved Hide resolved

podium/storage/vocab.py Outdated Show resolved Hide resolved

podium/storage/field.py Outdated Show resolved Hide resolved

mttk added 3 commits December 28, 2020 14:49

Change default value handling in specials

f471919

Merge

b14de8e

Address comments

5eee1d1

mttk added 2 commits January 4, 2021 21:52

Address comments

810ff98

Add rst documentation page for specials

d5a85cc

FilipBolt reviewed Jan 8, 2021

View reviewed changes

mttk added 2 commits January 8, 2021 17:22

merge

1631474

Fix docs

1851bb5

ivansmokovic assigned mttk Jan 11, 2021

mttk added 3 commits January 12, 2021 17:58

Address all comments

e210118

Tick vocab

7b0209d

Style

8e8c070

FilipBolt approved these changes Jan 13, 2021

View reviewed changes

Final comments

e5c9076

mttk merged commit 5273cac into master Jan 13, 2021

mttk deleted the vocab_specials branch January 13, 2021 15:29

Conversation

mttk commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivansmokovic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mttk commented Jan 12, 2021

Uh oh!

mariosasko commented Jan 12, 2021

Uh oh!

FilipBolt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mttk commented Dec 1, 2020 •

edited

Loading