You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when device='cpu' or device='mps' are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done using small-text API and require modification to its source code.
Steps to reproduce
Using small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py as an example:
Change tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
to
adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to model.resize_token_embeddings(new_num_tokens=len(tokenizer))
@RaymondUoE With just the additional tokenizer.add_special_tokens() call, I cannot reproduce the error. Can you provide details on the assertion output?
Bug description
If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when
device='cpu'
ordevice='mps'
are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done usingsmall-text
API and require modification to its source code.Steps to reproduce
Using
small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py
as an example:Change
tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
to
This will cause the device-side assertion to fail when using cuda:
due to embedding size mismatch.
Expected behavior
The model adjusts new vocab size automatically.
Workaround:
In file
small_text/integrations/transformers/utils/classification.py
function_initialize_transformer_components
, change the followingto
adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to
model.resize_token_embeddings(new_num_tokens=len(tokenizer))
Environment:
Python version: 3.11.7
small-text version: 1.3.3
small-text integrations (e.g., transformers): transformers 4.36.2
PyTorch version: 2.1.2
PyTorch-cuda: 11.8
The text was updated successfully, but these errors were encountered: