![](https://images.aicrowd.com/raw_images/challenges/banner_file/1005/f32303cdf2b8a796b74c.jpg)

<h2><center> Getting Started with Lingua Franca Translation</center></h2>

In this puzzle, we've to translate to english from crowd-talk language. There are multiple ways to build the language translator:
- Using Dictionary and Mapping
- Using LSTM
- Using Transformers

In this starter notebook, we'll go with dictionary and mapping. Here We'll create dictionary of words for both english and crowd-talk language. 

# Download the files 💾
## Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.


In [1]:
%%capture
!pip install aicrowd-cli
%load_ext aicrowd.magic


## Login to AIcrowd ㊗


In [2]:
%aicrowd login

Please login here: [34m[1m[4mhttps://api.aicrowd.com/auth/wyqieRJxMOaibLRRk7p_31k0czoz8Z2OpIynkJncWds[0m
[32mAPI Key valid[0m
[32mSaved API Key successfully![0m



## Download Dataset

We will create a folder name data and download the files there.


In [21]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c lingua-franca-translation -o data

sample_submission.csv:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/437k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

# Importing Necessary Libraries

In [22]:
import os
import pandas as pd

import gensim
from sklearn.metrics.pairwise import cosine_similarity

# Diving in the dataset:

In [23]:
train_df = pd.read_csv("data/train.csv")

In [24]:
train_df.head()

Unnamed: 0,id,crowdtalk,english
0,31989,wraov driourth wreury hyuirf schneiald chix lo...,upon this ladder one of them mounted
1,29884,treuns schleangly kriaors draotz pfiews schlio...,and solicited at the court of Augustus to be p...
2,26126,toirts choolt chiugy knusm squiend sriohl gheold,but how am I sunk!
3,44183,schlioncy yoik yahoos dynuewn maery schlioncy ...,the Yahoos draw home the sheaves in carriages
4,19108,treuns schleangly tsiens mcgaantz schmeecks tr...,and placed his hated hands before my eyes


#Training

In [25]:
!pip install simpletransformers



In [26]:
train_df.columns = ["id", "input_text", "target_text"]

train_df['prefix'] = "translate"

eval_df = train_df.tail(2955)[["prefix", "input_text", "target_text"]]
train_df = train_df.head(9000)[["prefix", "input_text", "target_text"]]

#eval_df = train_df.tail(1000)[["prefix", "input_text", "target_text"]]
#train_df = train_df.head(3000)[["prefix", "input_text", "target_text"]]

train_df.shape, eval_df.shape

((9000, 3), (2955, 3))

In [27]:
train_df

Unnamed: 0,prefix,input_text,target_text
0,translate,wraov driourth wreury hyuirf schneiald chix lo...,upon this ladder one of them mounted
1,translate,treuns schleangly kriaors draotz pfiews schlio...,and solicited at the court of Augustus to be p...
2,translate,toirts choolt chiugy knusm squiend sriohl gheold,but how am I sunk!
3,translate,schlioncy yoik yahoos dynuewn maery schlioncy ...,the Yahoos draw home the sheaves in carriages
4,translate,treuns schleangly tsiens mcgaantz schmeecks tr...,and placed his hated hands before my eyes
...,...,...,...
8995,translate,jauesk hyuefy liourd neaf treuns schleangly ja...,nor so slow and perplexed in their conceptions...
8996,translate,mcmoorth dwiountz choals tweuns kaorn schwauh ...,They would often strip me naked from top to toe
8997,translate,schroegs tweuws squiend sriohl threel zoiass g...,“I had likewise observed another thing
8998,translate,treuns schleangly scaiany thrern schlioncy yoi...,and walk about the streets and fields without ...


In [28]:
import logging

import pandas as pd
from simpletransformers.t5 import T5Model, T5Args

logging.basicConfig(level=logging.ERROR)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

# train_data = [
#     ["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
#     ["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
# ]
# train_df = pd.DataFrame(train_data)
# train_df.columns = ["prefix", "input_text", "target_text"]

# eval_data = [
#     ["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
#     ["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
# ]
# eval_df = pd.DataFrame(eval_data)
# eval_df.columns = ["prefix", "input_text", "target_text"]

# Configure the model
model_args = T5Args()
model_args.num_train_epochs = 20
model_args.no_save = True
model_args.evaluate_generated_text = True
model_args.evaluate_during_training = False
model_args.evaluate_during_training_verbose = False
model_args.overwrite_output_dir = True
model_args.save_model_every_epoch = False
model_args.save_steps = -1
model_args.silent = True

model = T5Model("t5", "t5-base", args=model_args, use_cuda=True)

# Train the model
model.train_model(train_df, eval_data=eval_df)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



(22500, nan)

In [29]:
# Evaluate the model
result = model.eval_model(eval_df)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    la

```
> toirts choolt cycleed czaurd squiend sriohl knusm xath
= But here also I am checked.
< But here I am miserable of <EOS>

> criaody sheatch schroegs tweuws hypuids flisp smauels xuann schlioncy yoik speiafy spleorry mcmiaors wheians veurg tiontly dieung priect shieurr truibly
= Two years had now nearly elapsed since the night on which he first received life
< The first he had been since since on the which which he first first first <EOS>

> squiend sriohl schlausk shoiasp wrauebs rhiorf freauth
= I wept like a child.
< I walked like a league <EOS>

> squiend sriohl shaills squiend sriohl mcnaiaks traff chreiast schoutch ceuntly
= I feel that I shall soon die
< I believe that I shall soon myself <EOS>

> troeght chriaoty hypaiaks hoeff schlioncy yoik choaw treuns schleangly zaiacy mcfipt
= who exhibited towards him the simplest and tenderest affection.
< who towards him him the and and queen <EOS>

> draotz pfiews piiy hydreong squiend sriohl schlausk cyclaask treuns schleangly baurr mcnaiaks traff qeald choals spraills cycluierg triild hyiaf mcdoocts squiend sriohl mcnaiaks traff squawn cheiact schloors rhiuny rheaunds treuns schleangly sproms
= At these moments I wept bitterly and wished that peace would revisit my mind only that I might afford them consolation and happiness.
< at these I I I I only and that I only my my my that that I would hear them <EOS>

> mcnaiaks traff seaut schlioncy yoik dits wheians veurg yoight mcnaiaks traff schriild
= that the same letters which compose that sentence
< that the same which which that that <EOS>

> squiend sriohl mcuafy straiav schloors rhiuny spluiey gheuck mcnatz mcgeaurg skiins claght luag rhairk schlioncy yoik peat cycluierg triild schneiald chix kloahl schieurry
= I looked upon them as superior beings who would be the arbiters of my future destiny.
< I looked upon them as the who who would be the of of my my <EOS>

> troeght spruengs mcgaantz schmeecks schroerty mccruems zally mcgaantz schmeecks pludly phost floff caith
= who drove his friend from his door with contumely?
< who fell his friend from his with with <EOS>

> treuns schleangly mckaiongs schweaung riaov friarn staatts wheians veurg schlioncy yoik mcniotts sleiarm groos tsuar schrioun schwauh szaiabs
= and thereby prevented any ill treatment which the others might have given me.
< and neither any thing which which the the have have given me me <EOS>
```

# Prediction Phase ✈

In [30]:
test_df = pd.read_csv("data/test.csv")

In [31]:
test_df['crowdtalk2'] = test_df.crowdtalk.apply(lambda x: "translate: "+x)
list(test_df['crowdtalk2'].values)

['translate: treuns schleangly throuys praests qeipp cycluierg triild schneiald chix siess',
 'translate: feosch treuns schleangly gliath spluiey gheuck sooc kniousts squiend sriohl',
 'translate: scraocs knaedly squiend sriohl clield whaioght spleorry mcmiaors cycluierg triild twoiasts',
 'translate: sqaups schlioncy yoik gnoirk cziourk schnaunk tiontly dieung schroegs tweuws schrioun schwauh szaiabs',
 'translate: schlioncy yoik psycheiancy mcountz pously mcnaiaks traff schluorty',
 'translate: mcik schlioncy yoik gnoirk cziourk hydriour sploct schneiald chix gleums throds',
 'translate: squiend sriohl zuff squiend sriohl squawn schmach gnoirk cziourk snuiet mcgaantz schmeecks hypaarry mcdiaongs gneufy gniug',
 'translate: tiontly dieung treiahl typeauty doigs cluierg sriaocs scuass choash mcgeesh schmoub schlioncy yoik yahoos schneiald chix mcnaiaks traff mccriug treuns schleangly gleums cycluiedly flueh wrusk',
 'translate: mcnaiaks traff snoird mcdaabs slaiy tseemp pfeaurk schwauh

In [32]:
to_predict = [
    'translate: treuns schleangly throuys praests qeipp cycluierg triild schneiald chix siess',
    'translate: squiend sriohl shaills squiend sriohl mcnaiaks traff chreiast schoutch ceuntly'
]
preds = model.predict(to_predict)
preds

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['and forthly strange things of my beauty', 'I feel that I shall soon die']

In [33]:
preds = model.predict(list(test_df['crowdtalk2'].values))
preds

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['and forthly strange things of my beauty',
 'scared and desponding as I was',
 'When I found myself on my feet',
 'according to the license he had given me',
 'the very worst effects that avarice',
 'added to the natural resources of those animals',
 'I hoped I might live to do his majesty some signal service',
 'he could find little or no resemblance between the Yahoos of that country',
 'that three hundred tailors should make me a matter of clothes',
 'and of those who desired the people’s express.',
 'but perpetually disturbed with dreams of the place I had left.',
 'soaring at him ever since.',
 'He took the country from her head and interpreted it to the cottage himself.',
 'The clothes and food of the children are plain and simple',
 'It was about five in the morning when I entered my father’s house',
 'The night was nearly dark',
 'for I confess I owede the security of my eyes',
 'without letting alone my intended labours would have made me man any of the ir',
 'These nurses we

In [34]:
test_df['prediction'] = preds

In [35]:
test_df.head()

Unnamed: 0,id,crowdtalk,crowdtalk2,prediction
0,27226,treuns schleangly throuys praests qeipp cyclui...,translate: treuns schleangly throuys praests q...,and forthly strange things of my beauty
1,31034,feosch treuns schleangly gliath spluiey gheuck...,translate: feosch treuns schleangly gliath spl...,scared and desponding as I was
2,35270,scraocs knaedly squiend sriohl clield whaioght...,translate: scraocs knaedly squiend sriohl clie...,When I found myself on my feet
3,23380,sqaups schlioncy yoik gnoirk cziourk schnaunk ...,translate: sqaups schlioncy yoik gnoirk cziour...,according to the license he had given me
4,92117,schlioncy yoik psycheiancy mcountz pously mcna...,translate: schlioncy yoik psycheiancy mcountz ...,the very worst effects that avarice


Saving the prediction in the asset directory with the same as submission.csv. 

In [None]:
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"), index=False)


# Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)


In [None]:
%aicrowd notebook submit -c lingua-franca-translation -a assets --no-verify