Transformers Library for Generative AI — The Basics

Sercan Gul | Data Scientist | DataScientistTX

3 min readMar 11, 2024

The transformers library, developed by Hugging Face, is a powerful tool for natural language processing (NLP) tasks using pre-trained models like BERT, GPT, RoBERTa, etc.

It provides easy access to a wide range of pre-trained models and allows fine-tuning on custom datasets for specific NLP tasks.

In this tutorial, I’ll walk you through the basics of using the transformers library in Python with code examples.

Installation

First, you need to install the transformers library along with its dependencies. You can do this via pip:

pip install transformers

Basic Usage

Let’s start with a simple example of loading a pre-trained model and performing text classification.

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Tokenize input text
text = "This is a sample sentence for classification."
inputs = tokenizer(text, return_tensors="pt")

# Perform classification
outputs = model(**inputs)
logits = outputs.logits

# Get predicted label
predicted_label = torch.argmax(logits, dim=1).item()
print("Predicted label:", predicted_label)

In this example, we’re using the bert-base-uncased model for text classification. We first load the pre-trained tokenizer and model. Then, we tokenize the input text using the tokenizer and convert it into a PyTorch tensor. Finally, we pass the tokenized input through the model to get the logits, which represent the scores for each class. We extract the predicted label by taking the argmax of the logits.

Fine-Tuning on Custom Dataset

Next, let’s see how to fine-tune a pre-trained model on a custom dataset for text classification.

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader

# Sample dataset
class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors='pt')
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Prepare data
texts = [...]  # List of input texts
labels = [...]  # List of corresponding labels

train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

train_dataset = CustomDataset(train_texts, train_labels, tokenizer, max_length=128)
val_dataset = CustomDataset(val_texts, val_labels, tokenizer, max_length=128)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

# Fine-tune the model
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()

In this example, we define a custom dataset class to prepare the data for fine-tuning. We split the dataset into training and validation sets, and then tokenize the texts using the pre-trained tokenizer.

We define training arguments such as the number of epochs, batch size, and evaluation strategy.

Finally, we fine-tune the model using the Trainer class.

Conclusion

This tutorial covers the basics of using the transformers library for NLP tasks in Python. You can explore more advanced features of the library for tasks like sequence generation, question answering, and more.

Check out the official documentation for detailed information and examples: [Transformers Documentation](https://huggingface.co/transformers/).

Transformers Library for Generative AI — The Basics

Written by Sercan Gul | Data Scientist | DataScientistTX

Responses (1)