Conversation
ivansmokovic
left a comment
There was a problem hiding this comment.
LGTM.
I have some questions.
Not sure about Specials subclassing str. What is the main benefit of this compared to a simple Special class?
|
|
||
|
|
||
| class VocabDict(dict): | ||
| class Special(str): |
There was a problem hiding this comment.
Is the user expected to implement own Specials at some point or are all Specials maintained by us?
There was a problem hiding this comment.
It is intended for the user to subclass specials in case of new usages (e.g., Mask is a relatively new case).
| 2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method. | ||
| 3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each Vocab has a single instance of each Special and for simpler referencing and contains checks. | ||
|
|
||
| To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token. |
There was a problem hiding this comment.
Do you maybe happen to know a resource which contains typical Specials used in NLP we could link here? After a quick Google search I could not find one.
There was a problem hiding this comment.
Vocabs in transformers (or tokenizers? not sure where they delegated the vocab) had quite a large number of reserved tokens.
There was a problem hiding this comment.
Yes, but this is the best I could find: https://huggingface.co/transformers/main_classes/tokenizer.html#pretrainedtokenizer
docs/source/specials.rst
Outdated
| @@ -0,0 +1,29 @@ | |||
| Special tokens | |||
| =============== | |||
| .. autoclass:: podium.storage.vocab.Special | |||
There was a problem hiding this comment.
this should (eventually) get refactored to omit storage, but looks good otherwise.
There was a problem hiding this comment.
Yes, IDK while docs refuse to use the shortened versions. I have an idea and might try it out soon.
|
@FilipBolt all comments addressed |
|
@mttk Can you please reference the issue that this PR will close? |
FilipBolt
left a comment
There was a problem hiding this comment.
Looks good to me, few minor comments left, but good to go in my mind
podium/vocab.py
Outdated
| [self.stoi[token] if token in self.stoi else unk_token for token in data] | ||
| ) | ||
| else: | ||
| # Either UNK is not in Vocab or the user has requested unknown tokens |
There was a problem hiding this comment.
Makes sense, we might need to update the docs to reflect what happens when UNK is in/out of the vocab.
Some outstanding things:
Add masking functionality in wrapper of vocab-> prepped for this but left for next PRdeterministicarg)Closes #216