Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer #53

Open
LunaRaeW opened this issue Jan 25, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@LunaRaeW
Copy link

Bug description

If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when device='cpu' or device='mps' are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done using small-text API and require modification to its source code.

Steps to reproduce

Using small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py as an example:

Change
tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
to

tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
tokenizer.add_special_tokens({'additional_special_tokens': ['[SPECIAL1]', '[SPECIAL2]']})

This will cause the device-side assertion to fail when using cuda:

clf_factory = TransformerBasedClassificationFactory(TRANSFORMER_MODEL,
                                                        num_classes,
                                                        kwargs=dict({
                                                            'device': 'cuda'
                                                        }))

due to embedding size mismatch.

Expected behavior

The model adjusts new vocab size automatically.

Workaround:

In file small_text/integrations/transformers/utils/classification.py function _initialize_transformer_components, change the following

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )

to

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )
    model.resize_token_embeddings(new_num_tokens=NEW_VOCAB_SIZE)

adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to model.resize_token_embeddings(new_num_tokens=len(tokenizer))

Environment:

Python version: 3.11.7
small-text version: 1.3.3
small-text integrations (e.g., transformers): transformers 4.36.2
PyTorch version: 2.1.2
PyTorch-cuda: 11.8

@LunaRaeW LunaRaeW added the bug Something isn't working label Jan 25, 2024
@chschroeder
Copy link
Contributor

Thanks for reporting this! I will look into it.

@chschroeder
Copy link
Contributor

@RaymondUoE With just the additional tokenizer.add_special_tokens() call, I cannot reproduce the error. Can you provide details on the assertion output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants