### Downloading Dataset

AIcrowd had a recent addition that allows you to directly download the dataset from any challenge using AIcrowd CLI. 

So we will first need to download the python library by AIcrowd that will allow us to download the dataset by just inputting the API key. 

In [1]:
!pip install aicrowd-cli

Collecting aicrowd-cli
[?25l  Downloading https://files.pythonhosted.org/packages/1f/57/59b5a00c6e90c9cc028b3da9dff90e242ad2847e735b1a0e81a21c616e27/aicrowd_cli-0.1.7-py3-none-any.whl (49kB)
[K     |██████▋                         | 10kB 18.1MB/s eta 0:00:01[K     |█████████████▏                  | 20kB 20.2MB/s eta 0:00:01[K     |███████████████████▉            | 30kB 16.3MB/s eta 0:00:01[K     |██████████████████████████▍     | 40kB 14.9MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 4.0MB/s 
[?25hCollecting gitpython<4,>=3.1.12
[?25l  Downloading https://files.pythonhosted.org/packages/a6/99/98019716955ba243657daedd1de8f3a88ca1f5b75057c38e959db22fb87b/GitPython-3.1.14-py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 10.8MB/s 
[?25hCollecting requests-toolbelt<1,>=0.9.1
[?25l  Downloading https://files.pythonhosted.org/packages/60/ef/7681134338fc097acef8d9b2f8abe0458e4d87559c689a8c306d0957ece5/requests_toolbelt-0.9.1-py2.py3-none

[32mAPI Key valid[0m
[32mSaved API Key successfully![0m


In [3]:
# Downloading the Dataset
!rm -rf data
!mkdir data


val.csv: 100% 714k/714k [00:00<00:00, 6.59MB/s]
test.csv: 100% 1.83M/1.83M [00:00<00:00, 11.4MB/s]
train.csv: 100% 7.00M/7.00M [00:00<00:00, 34.0MB/s]


### Downloading & Importing Libraries

Here we are going to use [HuggingFace 🤗](https://huggingface.co/) ! [HuggingFace](https://huggingface.co/) is a fast growing startup based on  Natural Language Processing providing many open source Natueal Language Processing libraries including [**transforms**](https://huggingface.co/transformers/), for using start-of-the-art transformers for Natural Language Processing tasks and [**datasets**](https://huggingface.co/docs/datasets/) containing datasets of NLP, and many evaluation metrics.

In [4]:
# Installing
!pip install datasets transformers rich

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/08/a2/d4e1024c891506e1cee8f9d719d20831bac31cb5b7416983c4d2f65a6287/datasets-1.8.0-py3-none-any.whl (237kB)
[K     |████████████████████████████████| 245kB 6.6MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 41.1MB/s 
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/8e/d2/d05466997f7751a2c06a7a416b7d1f131d765f7916698d3fdcb3a4d037e5/fsspec-2021.6.0-py3-none-any.whl (114kB)
[K     |████████████████████████████████| 122kB 51.1MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB

In [5]:
# Importing Libraries
import pandas as pd
import numpy as np
import os

import torch
import datasets
from datasets import load_dataset
from transformers import EncoderDecoderModel, EncoderDecoderConfig, BertTokenizerFast, Seq2SeqTrainingArguments, Seq2SeqTrainer, BertConfig

# To make cell output more beautiful! 
from rich.console import Console
from rich.table import Table
from rich import pretty
pretty.install()

# function to display YouTube videos
from IPython.display import YouTubeVideo

In [6]:
from transformers import RobertaTokenizerFast, RobertaConfig

### Reading Dataset

Reading the necessary files to train, validation & submit our results! 

In [7]:
train_dataset = pd.read_csv("data/train.csv")
validation_dataset = pd.read_csv("data/val.csv")
test_dataset = pd.read_csv("data/test.csv")

train_dataset

Unnamed: 0,text,label
0,"presented here Furthermore, naive improved. im...","Furthermore, the naive implementation presente..."
1,vector a in a form vector multidimensional spa...,Those coefficients form a vector in a multidim...
2,compatible of The model with recent is model s...,The model is compatible with a recent model of...
3,but relevance outlined. hemodynamics its based...,"The model is based on electrophysiology, but i..."
4,of transitions lever-like involve reorientatio...,Conformational transitions in macromolecular c...
...,...,...
39996,for pose a clutter. estimation autonomous mani...,Object pose estimation is a crucial prerequisi...
39997,objects added warehouses bin-picking present s...,Real-world bin-picking settings such as wareho...
39998,validation The proposed real-world on is metho...,The proposed method is evaluated on a syntheti...
39999,a between modes. statistics survival qualitati...,This breakdown is associated with a crossover ...


As we can see the **text** column contains senteces with shuffled words and the label column contining the corresponding sentences in correct form. 

In the below cell, we are using huggingface's dataset library to load our dataset, as you will see, this will help a lot in preprocessing our texts, and creating batches to put into the transformer model! 

In [8]:
# Loading the training, validation and testing dataset
dataset = load_dataset('csv', data_files={"train"     : ["data/train.csv"], 
                                          "validation": ["data/val.csv"], 
                                          "test"      : ["data/test.csv"]})
dataset

Using custom data configuration default-e7ff9ec3ded62018


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-e7ff9ec3ded62018/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e7ff9ec3ded62018/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.


# Preprocessing the dataset 🏭

There's a lot going on in below cell, let's debrief one by one!



### Understanding BERT Tokenizer

Now, we have already talked about what is tokenizer before, but we are going to covert a very little bit ot again and we are also getting some new things too things, so let's start - 

In [12]:
# So, we importing our tokenizer using transformers library, now the bert-base-uncased is the pretrained bert model, ( here uncased means that it is training using only lowercase words) 
# So, we will be using bert-base-uncased as our pretrained model for tokenization 

tokenizer = RobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9096718.0, style=ProgressStyle(descript…




In [13]:
sample_text = train_dataset['label'][0]
sample_text

In [14]:
# Splitting the inputted sentece into words
tokens = tokenizer.tokenize(sample_text)

# Convert the unique words into a specific number which is already mapped by the pretrained bert model 
token_ids = tokenizer.convert_tokens_to_ids(tokens)


print("Sample Text : ", sample_text)
print("Tokens      : ", tokens)
print("Token Ids   : ", token_ids)

Sample Text :  Furthermore, the naive implementation presented here can be improved.
Tokens      :  ['▁Fur', 'ther', 'more', ',', '▁the', '▁na', 'ive', '▁implementation', '▁presente', 'd', '▁here', '▁can', '▁be', '▁improve', 'd', '.']
Token Ids   :  [27766, 9319, 17678, 4, 70, 24, 5844, 208124, 8121, 71, 3688, 831, 186, 52295, 71, 5]


In [15]:
# To convert a token ids back to origin string, we can simply use

output_text = tokenizer.decode(token_ids)
output_text

## Creating the Dataset

In [16]:
# To have consistent dimensions of output vector accross the samples, we have to set the maximam number of tokens for each sample.  
MAX_TEXT_LENGTH = 40
MAX_LABEL_LENGTH = 40

def preprocess_function(sample):
  
  # Getting text and label
  text = sample["text"]
  label = sample["label"]

  # Tokenizing the text and label
  inputs = tokenizer(text, padding="max_length", truncation=True, max_length=MAX_TEXT_LENGTH)
  outputs = tokenizer(label, padding="max_length", truncation=True, max_length=MAX_LABEL_LENGTH)


  sample["input_ids"] = inputs.input_ids
  sample["attention_mask"] = inputs.attention_mask
  sample["decoder_input_ids"] = outputs.input_ids
  sample["decoder_attention_mask"] = outputs.attention_mask
  sample["labels"] = outputs.input_ids

  # The labels are used to calcuate the loss while training, and because we added padding to make all tokens to be of same size,
  # we also need to convert the padding number ( 0 ) to ( -100 ), so that we can tell huggingface that these number can be ignorned while calcuating loss. 
  # Why specifically -100 ? It's simply an arbitrary number, again so that huggingface can ignore this number while calcuating loss
      
  sample["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in sample["labels"]]

  return sample

In [17]:
# Applying the preprocessing to every sample

BATCH_SIZE = 16
      
tokenized_datasets = dataset.map(preprocess_function, batch_size=BATCH_SIZE, batched=True)

HBox(children=(FloatProgress(value=0.0, max=2501.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=251.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=625.0), HTML(value='')))




In [18]:
# Convert the list into torch tensor
tokenized_datasets.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

In [19]:
print(tokenized_datasets['train'][0].keys())
tokenized_datasets['train'][0], tokenized_datasets['train'][1]

dict_keys(['attention_mask', 'decoder_attention_mask', 'decoder_input_ids', 'input_ids', 'labels'])


In [20]:
# Using a pretrained BERT Model
model = EncoderDecoderModel.from_encoder_decoder_pretrained("xlm-roberta-base", "xlm-roberta-base")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=512.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115590446.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForCausalLM were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['roberta.encoder.layer.11.crossattention.self.query.weight', 'roberta.encoder.layer.7.crossattention.self.value.weight', 'roberta.encoder.layer.6.cross

In [21]:
# The model architecture
model

# Training the model

## Setting up Training

In [22]:
# Setting up the parameters
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.encoder.vocab_size

In [23]:
# Setting up, batch size, number of epochs

N_EPOCHS = 1

args = Seq2SeqTrainingArguments(
    "Scambled Text",
    evaluation_strategy = "epoch",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=N_EPOCHS,
    fp16=True, # This will hlep in increasing the speed of training
)

In [24]:
# Setting up training

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

## Training
🚀 Let's goooo!

In [25]:
# This will take around 20-25 minutes 

trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

# Submitting Results 📄

Phew! I was a lot of grasp, anyway, let's make a submission real quick.   

In [26]:
# Getting the predictions
def generate_predictions(batch):

    # Tokenizing the test
    inputs = tokenizer(batch["text"], padding="max_length", truncation=True, max_length=MAX_TEXT_LENGTH, return_tensors="pt")
    
    # Sending the tensors to GPU
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    # Generating the predicted tokens ids
    outputs = model.generate(input_ids, attention_mask=attention_mask)

    # Converting the token ids to sentence
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["predictions"] = output_str

    return batch

In [27]:
# Getting all results
results = dataset['test'].map(generate_predictions, batched=True, batch_size=16)

HBox(children=(FloatProgress(value=0.0, max=625.0), HTML(value='')))




In [28]:
test_dataset

Unnamed: 0,id,text,label
0,0,safely objects move. that system images detect...,"system Using approach, move. safely skip of th..."
1,1,We detectors popular influence of confidences ...,confidences popular different in detectors We ...
2,2,compact coding present supervised We a approac...,"We a supervised approach, compact present codi..."
3,3,study high-throughput vital of quantitative be...,is for and individuals study of of collective ...
4,4,on data sets. We evaluate method many challeng...,sets. challenging the data We method on many e...
...,...,...,...
9995,9995,"particular i.e. problem, of to due However, na...","the i.e. of problem, However, particular to th..."
9996,9996,Simulation methods. proposed outperforming met...,state-of-the-art the that of Simulation demons...
9997,9997,in This view introduces a scenarios. paper tec...,used label introduces in placement scenarios. ...
9998,9998,valve. water from water are and pipeline sourc...,source noise interference of The are valve. pi...


In [29]:
test_dataset['label'] = results['predictions']
test_dataset

Unnamed: 0,id,text,label
0,0,safely objects move. that system images detect...,"U this this, detection det det detection detec..."
1,1,We detectors popular influence of confidences ...,We have three three in of three three three in...
2,2,compact coding present supervised We a approac...,We present a compact compact compact compact c...
3,3,study high-throughput vital of quantitative be...,Tranjectject of of of of oftivetive ofive ofti...
4,4,on data sets. We evaluate method many challeng...,We evaluat evaluat proposed propose challengin...
...,...,...,...
9995,9995,"particular i.e. problem, of to due However, na...","However, particular the particular the particu..."
9996,9996,Simulation methods. proposed outperforming met...,Simulatione demonstrat demonstrat demonstratd ...
9997,9997,in This view introduces a scenarios. paper tec...,This paper paper as amentmentmentmentmentmentm...
9998,9998,valve. water from water are and pipeline sourc...,"The the no the input, input theise input,, inp..."


**Note : Please make sure that there should be filename `submission.csv` in `assets` folder before submitting it**

In [30]:
!mkdir assets
test_dataset.to_csv(os.path.join("assets", "submission.csv"), index=False)

## Uploading the Results 
**Note : Please save the notebook before submitting it (Ctrl + S)**

[1;34mMounting Google Drive 💾[0m
Your Google Drive will be mounted to access the colab notebook
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AY0e-g4iGb1rn19MZD4tHIdQzZJZfUOvAnVjAH-Pm-dHq9S0CAGDtVu1AtI


Congratulations 🎉 you did it, but there still a lot of improvement that can be made, you can try changing many hyperparameters inc. epochs, learning rate or different model architecture!

And btw -

> Don't be shy to ask question related to any errors you are getting or doubts in any part of this notebook in [discussion forum](https://www.aicrowd.com/challenges/ai-blitz-9/problems/de-shuffling-text/discussion) or in [AIcrowd Discord sever](https://discord.gg/T6uZSWBMSZ), AIcrew will be happy to help you :)

Also, want give us your valuable feedback for next blitz or want to work with us creating blitz challanges ? Let us know! 