![](https://images.aicrowd.com/raw_images/challenges/banner_file/1005/f32303cdf2b8a796b74c.jpg)

<h2><center> Getting Started with Lingua Franca Translation</center></h2>

In this puzzle, we've to translate to english from crowd-talk lanugage. There are multiple ways to build the language translator:
- Using Dictionary and Mapping
- Using LSTM
- Using Transformers

In this starter notebook, we'll go with dictionary and mapping. Here We'll create dictionary of words for both english and corwd-talk language. 

# Download the files 💾
## Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.


In [1]:
%%capture
!pip install aicrowd-cli
%load_ext aicrowd.magic


## Login to AIcrowd ㊗


In [2]:
%aicrowd login

Please login here: [34m[1m[4mhttps://api.aicrowd.com/auth/sZ7gLTekjIZZOfaLq4ddVltpashghpepc9YlzfZqbIU[0m
[32mAPI Key valid[0m
[32mSaved API Key successfully![0m



## Download Dataset

We will create a folder name data and download the files there.


In [3]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c lingua-franca-translation -o data

sample_submission.csv:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/437k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

# Importing Necessary Libraries

In [4]:
import os
import pandas as pd
import gensim
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Diving in the dataset:

In [6]:
train_df = pd.read_csv("data/train.csv")

In [7]:
test_df = pd.read_csv("data/test.csv")

In [8]:
from gensim.models import Phrases
from gensim.models import Word2Vec
# Train a bigram detector.

my_sents=[s.split(" ") for s in train_df.crowdtalk]

bigram_transformer = Phrases(my_sents,min_count=3)

# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[my_sents], min_count=1)



In [9]:
my_sents=[s.split(" ") for s in train_df.crowdtalk]


In [10]:
bi_words=set([s.decode('utf-8') for s in list(bigram_transformer.vocab.keys())])

In [85]:
len(bi_words)

62686

In [96]:
crowd_words=[]
lengths=[]
for s in train_df.crowdtalk.values:
  words= s.split(" ")
  lengths.append(len(words))
  for w in words:
    crowd_words.append(w)

In [97]:
my_test=[s.split(" ") for s in test_df.crowdtalk]
test_set=[]
for s in bigram_transformer[my_test]:
  for w in s:
    test_set.append(w)



In [12]:
not_present=[]
present=[]
for word in test_set:
  if word not in bi_words:
    not_present.append(word)
  else:
    present.append(word)

In [19]:
len(not_present)

1242

In [11]:
words=[[word.lower() for word in nltk.word_tokenize(s) if word.isalnum()] for s in train_df.english.values]
english_words=[]
for s in words:
  for w in s:
    english_words.append(w)
english_words=set(english_words)

In [83]:
len(english_words)

8970

In [12]:
crowd_indices={v:k for k,v in enumerate(bi_words)}
english_indices={v:k for k,v in enumerate(english_words)}

In [13]:
reverse_english_idx={v:k for k,v in english_indices.items()}

In [29]:
import numpy as np
crowd_to_english_mat=np.zeros((len(english_words),len(bi_words)))

In [None]:
words

In [30]:
my_sents=[s.split(" ") for s in train_df.crowdtalk]
for crowd_sentence,english_sentence in zip(my_sents,words):
  bi_sent=bigram_transformer[crowd_sentence]
  for wi,word in enumerate(bi_sent):
    try:
      word_idx=crowd_indices[word]
      ewords_idx=[english_indices[eword]for eword in english_sentence]
      for ei,eidx in enumerate(ewords_idx):
        if abs(wi-ei)<2:
          crowd_to_english_mat[eidx,word_idx]+=1
      # crowd_to_english_mat[ewords_idx,word_idx]+=1
    except KeyError:
      pass



In [102]:
crowd_to_english_mat.shape

(8970, 62686)

In [35]:
from sklearn.preprocessing import StandardScaler
scl=StandardScaler(copy=False)

In [16]:
for i in range(crowd_to_english_mat.shape[0]):
  crowd_to_english_mat[i,crowd_to_english_mat[i,:]!=crowd_to_english_mat[i,:].max(]=0

In [22]:
for w in stopwords.words('english'):
  if w in english_words:
    i=english_indices[w]
    crowd_to_english_mat[i,crowd_to_english_mat[i,:]!=np.max(crowd_to_english_mat[i,:])]=0

In [31]:
translation_dict={}
crowd_to_english_mat/=(np.mean(crowd_to_english_mat, axis=1).reshape(-1,1)+1)
for word in bi_words:
  word_idx=crowd_indices[word]
  max_trans_idx=np.argmax(crowd_to_english_mat[:,word_idx])
  translation=reverse_english_idx[max_trans_idx]
  translation_dict[word]=translation

In [143]:
len(stopwords.words('english'))

179

In [32]:
import  nltk.translate.bleu_score as bleu
bleues=[]
sentences=[]
for i in range(len(train_df.crowdtalk.values)):
  reference_trans=[train_df.english[i].lower().split(" ")]
  candidate=[translation_dict[w] for w in bigram_transformer[my_sents[i]]]
  sentences.append(" ".join(candidate))
  score=bleu.sentence_bleu(reference_trans,candidate)
  bleues.append(score)
  if score<.3:
    print(" ".join(reference_trans[0]),"||||"," ".join(candidate))

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


at the centre of the island there is a chasm about fifty yards in diameter |||| at the the centre the the of island is is a fifty fifty yards in diameter
“that for the sake of my patron the king of luggnagg |||| that for the the sake my of patron the the king of luggnagg
“that if good fortune ever restored me to my native country |||| that if fortune ever restored me my to native
in order to stifle or divert the clamour of the subjects against their evil administration. |||| in order to theodorus or divert the the clamour the the of subjects against their evil administration
pronouncing with fervour the names of the most distinguished discoverers. |||| pronouncing with fervour the the names the the of most distinguished trample
and the hue both of that and the dug |||| the the and hue both of that the the and casually
“the volume of plutarch’s lives which i possessed contained the histories of the first founders of the ancient republics. |||| the the volume of plutarch lives which i po

In [183]:
pd.DataFrame({"english":train_df.english,"translated":sentences}).to_csv('inspection.csv')

In [33]:
np.mean(bleues)

0.7034426681786159

In [34]:
reference_trans=["i went to the park".lower().split(" ")]
candidate2='i went the to park'.split()
bleu.sentence_bleu(reference_trans,candidate2)

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


0.7071067811865476

In [None]:
train_df.crowdtalk[2]

'toirts choolt chiugy knusm squiend sriohl gheold'

In [None]:
[translation_dict[w] for w in bigram_transformer[my_sents[1]]],[train_df.english[1].lower().split(" ")]


In [None]:
[translation_dict[w] for w in bigram_transformer[my_sents[0]]]



['upon', 'this', 'ladder', 'of', 'one', 'them', 'mounted']

In [None]:
for p in bigram_transformer.export_phrases(train_df.crowdtalk.values):
  print(p)

In [None]:
bigram_transformer[my_sents[2]]



['toirts_choolt', 'chiugy', 'knusm', 'squiend_sriohl', 'gheold']

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word',ngram_range=(1,2),min_df=3,max_features=100)
cv_fit=cv.fit_transform(train_df.crowdtalk)

In [None]:
print()

[ 516 4372  774 2020 2020 2068 2068 3658  807  807  874  874  842  593
  593  537  537  600 1054 2020  419  419  921 3658 3658  465  465  855
  405  405 1512  690  405  670  670  555  555  874  882  882  595  600
  600 1477 1477  419  593 1054 1054  571  571  670 2486 4007 6554 6554
  882  690  690 4372 4372  928  928 1233 1233  855  855 1512 1512  516
  516  595  595  921  921 3534 3534 3534  807  571 1233  842  842  774
  774 1477 4007 4007 2068  465  928  847  847  847 2486 2486  537 6554
  695  555]


In [None]:
print()
for k,v in zip(cv.get_feature_names(),cv_fit.toarray().sum(axis=0)):
  print(k,v)

In [None]:
model.train(bigram_transformer[my_sents], total_examples=len(train_df.crowdtalk.values), epochs=3)



(243383, 354384)

In [None]:
model.vocabulary.raw_vocab

defaultdict(int, {})

In [None]:
train_df

Unnamed: 0,id,crowdtalk,english
0,31989,wraov driourth wreury hyuirf schneiald chix lo...,upon this ladder one of them mounted
1,29884,treuns schleangly kriaors draotz pfiews schlio...,and solicited at the court of Augustus to be p...
2,26126,toirts choolt chiugy knusm squiend sriohl gheold,but how am I sunk!
3,44183,schlioncy yoik yahoos dynuewn maery schlioncy ...,the Yahoos draw home the sheaves in carriages
4,19108,treuns schleangly tsiens mcgaantz schmeecks tr...,and placed his hated hands before my eyes
...,...,...,...
11950,50106,hydriaond cieurry mcdaabs swiings schlioncy yo...,about five hundred leagues to the east
11951,14786,treuns schleangly criaody treuns schleangly wr...,) and two and a half in breadth
11952,16903,toirts choolt cycluierg triild schuony hypuids...,“But my toils now drew near a close
11953,68451,toantz spluiey gheuck schoutch spluiey gheuck ...,going as soon as I was dressed to pay my atten...


In [87]:
crowd_words=[]
lengths=[]
for s in train_df.crowdtalk.values:
  words= s.split(" ")
  lengths.append(len(words))
  for w in words:
    crowd_words.append(w)

In [88]:
len(set(crowd_words))

9245

In [None]:
import jellyfish
from tqdm.notebook import tqdm
mins=[]
crowd_vocab=list(set(crowd_words))
for word in tqdm(not_present):
  matches=[jellyfish.levenshtein_distance(word, b_token) for b_token in crowd_vocab]
  mins.append(partial_idx)
  if np.min(matches)==1:
    print(word,crowd_vocab[np.argmin(matches)])

In [None]:
mins

In [None]:
import numpy as np
np.mean(lengths)

13.66624843161857

In [None]:
english_words=[]
lengths=[]
for s in train_df.english.values:
  words= s.split(" ")
  lengths.append(len(words))
  for w in words:
    english_words.append(w.lower())

In [None]:
len(english_words),len(crowd_words),len(set(english_words))

(112431, 163380, 11492)

In [None]:
english = train_df.english.values
crowdtalk = train_df.crowdtalk.values

In [None]:
english

array(['upon this ladder one of them mounted',
       'and solicited at the court of Augustus to be preferred to a greater ship',
       'but how am I sunk!', ..., '“But my toils now drew near a close',
       'going as soon as I was dressed to pay my attendance upon his honour',
       'for there was no sign of any violence except the black mark of fingers on his neck.'],
      dtype=object)

In [None]:
processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in english]
#eng_word_list = [word for words in processedLines for word in words]

eng_word_list = [word[0] for word in processedLines ]  # only 1-th words (Bleu = 0.080)  !!!


In [None]:
processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in crowdtalk]
#crowdtalk_word_list = [word for words in processedLines for word in words]

crowdtalk_word_list = [word[0] for word in processedLines]  # only 1-th words (Bleu = 0.080)  !!!


In [None]:
dict1 = dict(zip(crowdtalk_word_list, eng_word_list))

# Prediction Phase ✈

In [37]:
crowdtalk = test_df.crowdtalk.values

In [38]:
processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in crowdtalk]

In [None]:
!pip install jellyfish

In [35]:
!pip install gingerit


Collecting gingerit
  Downloading gingerit-0.8.2-py3-none-any.whl (3.3 kB)
Installing collected packages: gingerit
Successfully installed gingerit-0.8.2


In [36]:


from gingerit.gingerit import GingerIt

text = 'according the to license he had me'

parser = GingerIt()
parser.parse(text)

{'corrections': [],
 'result': 'according the to license he had me',
 'text': 'according the to license he had me'}

In [None]:
!pip install -U git+https://github.com/PrithivirajDamodaran/Gramformer.git


In [50]:
!pip install spacy



In [None]:
!python -m spacy download en_core_web_lg # Downloaing the model for english language will contains many pretrained preprocessing pipelines 


In [55]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

In [101]:
from tqdm.notebook import tqdm
sentences3=[]
bi_words_list=list(bi_words)
followups=[]
for i in tqdm(range(len(processedLines))):
  sentence=processedLines[i]
  translation_tokens=[]
  bi_sent=bigram_transformer[sentence]
  for token in bi_sent:
    if token in translation_dict:
      translation_tokens.append(translation_dict[token])
    # elif token[:-1] in translation_dict and (token[-1]=='s' or token[-1]=='z'):
    #   print("actually here")
    #   translation_tokens.append(translation_dict[token[:-1]]+'s')
    # elif token+'s' in translation_dict:
    #   print("wow also here")
    #   translation_tokens.append(translation_dict[token+'s'][:-1])

  sent_modified=[]
  sent_modified.append(translation_tokens[0])
  for i in range(1,len(translation_tokens)):
    if not translation_tokens[i] == translation_tokens[i-1]:
      sent_modified.append(translation_tokens[i])

  final_sent=' '.join(sent_modified)
  sent_modified=[]
  doc=nlp(final_sent)
  continue_flag=False
  for i,t in enumerate(doc):
    if continue_flag:
      continue_flag=False
      continue
#'PART','ADP','CCONJ'
    if t.text=='the' and i<len(doc)-1 and (doc[i+1].pos_ in['PART','ADP','CCONJ']or doc[i+1].text=='that'):
      sent_modified.append(doc[i+1].text)
      sent_modified.append(t.text)
      continue_flag=True
    else:
      sent_modified.append(t.text)

    # else:
    #   partial_idx=np.argmin([jellyfish.levenshtein_distance(token, b_token) for b_token in bi_words_list])
    #   closest_word=bi_words_list[partial_idx]
    #   translation_tokens.append(translation_dict[closest_word])
  sentences2.append(parser.parse(' '.join(sent_modified))['result'].replace("  ",' '))

  0%|          | 0/3985 [00:00<?, ?it/s]



In [107]:
len(sentences2[:test_df.shape[0]])

3985

In [106]:
test_df.shape[0]

3985

In [62]:
from collections import Counter
Counter(followups)

Counter({'ADJ': 270,
         'ADP': 705,
         'ADV': 25,
         'AUX': 9,
         'CCONJ': 67,
         'DET': 30,
         'NOUN': 1032,
         'NUM': 6,
         'PART': 14,
         'PRON': 2,
         'PROPN': 22,
         'SCONJ': 2,
         'VERB': 28,
         'X': 2})

In [None]:
import jellyfish


2

Creating sentences by matching english word corresponding the new langauge word in the sentence using the dictionary mapping created.



In [None]:
sentences3=[]
for sent in sentences2:
  sentence_split=sent.split()
  sent_modified=[]
  sent_modified.append(sentence_split[0])
  for i in range(1,len(sentence_split)):
    if not sentence_split[i] == sentence_split[i-1]:
      sent_modified.append(sentence_split[i])
    else:
      print("here")
  sentences3.append(" ".join(sent_modified))

In [None]:
sentence = []

for i in processedLines:
  sentence_part = []
  word = ''
  for k, j in enumerate(i):
    if j in dict1:
      word = ''.join(dict1[j])
    else:
      word = ''.join(' ')
    sentence_part.append(word)
    temp = ' '.join(sentence_part)
  sentence.append(temp)

In [108]:
test_df['prediction'] = sentences2[test_df.shape[0]:]

In [None]:
from gingerit.gingerit import GingerIt

parser = GingerIt()
res=parser.parse('and of strange things my of beauty')['result'].replace("  ",' ')

In [76]:
reverse_trans_dict={v:k for k,v in translation_dict.items()}

In [None]:
reverse_trans_dict

In [None]:
for word in english_words:
  if any([True for w in list(reverse_trans_dict.keys()) if word==w+'s']):
    print(word,reverse_trans_dict[word],reverse_trans_dict[word[:-1]])

In [109]:
test_df.prediction

0                   and reported strange things my beauty
1                             scared and crimes, as was I
2                                 when I found on my feet
3                      according to the license he had me
4                     the very worst effects that avarice
                              ...                        
3980                                 when it did not rain
3981                               but she did not answer
3982    when it was found could I neither understand n...
3983                 by which they distinguish themselves
3984                              till could I reach them
Name: prediction, Length: 3985, dtype: object

In [None]:
for s in sentences2:
  if 'I and 'in s:
    print(True)

In [92]:
test_df.to_csv('./translated.csv')

In [111]:
test_df

Unnamed: 0,id,crowdtalk,prediction
0,27226,treuns schleangly throuys praests qeipp cyclui...,and reported strange things my beauty
1,31034,feosch treuns schleangly gliath spluiey gheuck...,"scared and crimes, as was I"
2,35270,scraocs knaedly squiend sriohl clield whaioght...,when I found on my feet
3,23380,sqaups schlioncy yoik gnoirk cziourk schnaunk ...,according to the license he had me
4,92117,schlioncy yoik psycheiancy mcountz pously mcna...,the very worst effects that avarice
...,...,...,...
3980,22854,scraocs knaedly daioc mceab spriaonn schmeips ...,when it did not rain
3981,24201,toirts choolt blointly spriaonn schmeips krous...,but she did not answer
3982,33494,scraocs knaedly daioc mceab sooc kniousts clie...,when it was found could I neither understand n...
3983,28988,czogy stoorty wheians veurg mcmoorth dwiountz ...,by which they distinguish themselves


Saving the prediction in the asset directory with the same as submission.csv. 

In [112]:
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"), index=False)


# Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)


In [None]:
%aicrowd notebook submit -c lingua-franca-translation -a assets --no-verify