# Amazon KDD Cup 2023 - Task 3 - Next Product Title Generation

![](https://images.aicrowd.com/raw_images/challenges/banner_file/1118/ba4ead7cfb583ee688ac.jpg)

This notebook will contains instructions and example submission with random predictions.



## Installations ü§ñ

1. `aicrowd-cli` for downloading challenge data and making submissions
2. `pyarrow` for saving to parquet for submissions

In [1]:
!pip install aicrowd-cli pyarrow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.15-py3-none-any.whl (51 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m51.1/51.1 KB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting click<8,>=7.1.2
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m82.8/82.8 KB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rich<11,>=10.0.0
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m214.4/214.4 KB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Collecting python-slugify<6,>=5.0.0
  Downloading pyt

## Login to AIcrowd and download the data üìö

In [None]:
!aicrowd login

In [3]:
!aicrowd dataset download --challenge task-3-next-product-title-generation

sessions_test_task1.csv: 100% 19.4M/19.4M [00:01<00:00, 16.1MB/s]
sessions_test_task2.csv: 100% 1.92M/1.92M [00:00<00:00, 4.36MB/s]
sessions_test_task3.csv: 100% 2.67M/2.67M [00:00<00:00, 5.55MB/s]
products_train.csv: 100% 589M/589M [01:07<00:00, 8.69MB/s]
sessions_train.csv: 100% 259M/259M [00:18<00:00, 13.9MB/s]


## Setup data and task information

In [4]:
import os
import numpy as np
import pandas as pd
from functools import lru_cache

In [5]:
train_data_dir = '.'
test_data_dir = '.'
task = 'task3'
PREDS_PER_SESSION = 100

In [6]:
# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

## Data Description

The Multilingual Shopping Session Dataset is a collection of **anonymized customer sessions** containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: **user sessions** and **product attributes**. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.

---

### Each product as its associated information:


**locale**: the locale code of the product (e.g., DE)

**id**: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

**title**: title of the item (e.g., ‚ÄúJapanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt‚Äù)

**price**: price of the item in local currency (e.g., 24.99)

**brand**: item brand name (e.g., ‚ÄúJapanese Aesthetic Flowers & Vaporwave Clothing‚Äù)

**color**: color of the item (e.g., ‚ÄúBlack‚Äù)

**size**: size of the item (e.g., ‚Äúxxl‚Äù)

**model**: model of the item (e.g., ‚Äúiphone 13‚Äù)

**material**: material of the item (e.g., ‚Äúcotton‚Äù)

**author**: author of the item (e.g., ‚ÄúJ. K. Rowling‚Äù)

**desc**: description about a item‚Äôs key features and benefits called out via bullet points (e.g., ‚ÄúSolid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers ‚Ä¶‚Äù)


## EDA üíΩ

In [7]:
def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")

In [8]:
products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)

Locale: DE 
Number of products: 518327 
Number of train sessions: 1111416 
Train session lengths - Mean: 57.89 | Median 40.00 | Min: 27.00 | Max 2060.00 
Number of test sessions: 10000
Test session lengths - Mean: 39.92 | Median 27.00 | Min: 27.00 | Max 581.00 


Locale: JP 
Number of products: 395009 
Number of train sessions: 979119 
Train session lengths - Mean: 59.61 | Median 40.00 | Min: 27.00 | Max 6257.00 
Number of test sessions: 10000
Test session lengths - Mean: 40.23 | Median 27.00 | Min: 27.00 | Max 436.00 


Locale: UK 
Number of products: 500180 
Number of train sessions: 1182181 
Train session lengths - Mean: 54.85 | Median 40.00 | Min: 27.00 | Max 2654.00 
Number of test sessions: 10000
Test session lengths - Mean: 48.85 | Median 40.00 | Min: 27.00 | Max 410.00 


Locale: ES 
Number of products: 42503 
Number of train sessions: 89047 
Train session lengths - Mean: 48.82 | Median 40.00 | Min: 27.00 | Max 792.00 
Number of test sessions: 6421
Test session lengths - Mean: 

In [9]:
products.sample(5)

Unnamed: 0,id,locale,title,price,brand,color,size,model,material,author,desc
72043,B09T66362W,DE,"LED Kabellose Maus, Tragbar 2.4 G Wiederauflad...",11.99,Asnoty,Schwarz,,,,,„ÄêBuntes LED-Licht„Äë 7 weiche LED-Farben wechsel...
1372965,B093FB22WT,UK,Jefshon Baby Piano Musical Mats 35 Music Sound...,14.89,Jefshon,Green,,GP5922,Polyester,,[Safe Material and Anti- Slip] : This musical ...
208601,B08HC8VWG4,DE,"Kalorik TKG MW 2500 DG, Mikrowelle, 25 Liter I...",184.89,Kalorik,Cremefarben,,TKG MW 2500 DG,Kunststoff,,Auch mit Grillfunktion und Auftaufunktion
933563,B09TVLRY41,UK,"Ultrasonic Toothbrush for Adults 5 Modes ,Soni...",34.99,OKMIMO,Black,,,,,2 Minutes Smart Timer & Brushing Reminder - Ul...
656334,B08XW7MZNX,JP,„É¨„ÉÉ„ÇØ „Éû„É´„ÉÅ Ê∞¥Âàá„Çä„Åã„Åî („ÉØ„Ç§„Éâ) SIAAÊäóËèå„ÄÅÊµÅ„Çå„Çã/ÊµÅ„Çå„Å™„ÅÑÈÅ∏„Åπ„Çã„Éà„É¨„Éº„ÄÅ„Ç≥„ÉÉ„Éó„Éª...,2909.0,„É¨„ÉÉ„ÇØ(LEC),ÁôΩ,„ÉØ„Ç§„Éâ,K00405,„Çπ„ÉÜ„É≥„É¨„ÇπÈãº,,„Ç∞„É©„Çπ„ÄÅ„Éú„Éà„É´„Çπ„Çø„É≥„Éâ„ÅåÂ§ñÂÅ¥„Å´„ÅÇ„Çã„ÅÆ„Åß„Ç´„Ç¥„ÅÆ‰∏≠„ÇíÂ∫É„ÄÖ‰Ωø„Åà„Åæ„Åô„ÄÇ


In [10]:
train_sessions = read_train_data()
train_sessions.sample(5)

Unnamed: 0,prev_items,next_item,locale
1646342,['B017SFJ8WK' 'B07ZCW5KD9' 'B07ZCVZWP3' 'B08ZV...,B08GKJ21Y5,JP
343344,['B08QF9F8BC' 'B098DQ87NT' 'B08QF9F8BC'],B08SGX86X7,DE
2627480,['B0B3LRD151' 'B08DM2YN8G' 'B083HRZBDT' 'B08DM...,B0B7JZVCH4,UK
2742311,['B07TBRJKX4' 'B003ZG7CMA'],B00JQ2AJHC,UK
3138973,['B0191BQXK4' 'B005UXMZK0'],B018A6U5SW,UK


In [11]:
test_sessions = read_test_data(task)
test_sessions.sample(5)

Unnamed: 0,prev_items,locale
44027,['B09HXCNVQ9' 'B09NKQPFKX' 'B09NKQT9GY' 'B09NK...,JP
29760,['B07QVLL68D' 'B08B3WW4SG' 'B093CC8N7X'],IT
41208,['B083R1QYQD' 'B083R1RQLR' 'B083R1QYQD' 'B083R...,JP
6862,['B09J95311C' 'B07PP343KJ'],DE
9402,['B075LFT858' 'B075LMPFB3'],DE


## Generate Submission üèãÔ∏è‚Äç‚ôÄÔ∏è



Submission format:
1. The submission should be a **parquet** file with the sessions from all the locales.  
2. Predictions should be added in new column named **"next_item_prediction"**.
3. Predictions should be a single string, the next product title for the session.

In [12]:
def random_predicitons(locale, sess_test_locale):
    random_state = np.random.RandomState(42)
    products = read_product_data().query(f'locale == "{locale}"')
    predictions = (products['title']
                   .sample(len(sess_test_locale), replace=True, random_state=random_state)
                   .values
    )
    sess_test_locale['next_item_prediction'] = predictions
    sess_test_locale.drop('prev_items', inplace=True, axis=1)
    return sess_test_locale

In [13]:
test_sessions = read_test_data(task)
predictions = []
test_locale_names = test_sessions['locale'].unique()
for locale in test_locale_names:
    sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
    predictions.append(
        random_predicitons(locale, sess_test_locale)
    )
predictions = pd.concat(predictions).reset_index(drop=True)
predictions.sample(5)

Unnamed: 0,locale,next_item_prediction
45091,JP,"„Ç™„Ç¶„É´„ÉÜ„ÉÉ„ÇØ Ë∂Ö„Çø„Éï „É©„Ç§„Éà„Éã„É≥„Ç∞„Ç±„Éº„Éñ„É´ ËÄêÂ±àÊõ≤50,000Âõû AppleË™çË®º iPhon..."
52522,UK,"320ml Hot Water Bottle with Knited Cover, Mini..."
2614,ES,kwmobile Carcasa Compatible con Samsung Galaxy...
33688,IT,Caff√® Borbone Miscela Decaffeinata Cialda Comp...
16942,FR,"JETech Coque Ultra Fine (0,35 mm) pour iPhone ..."


# Validate predictions ‚úÖ

In [14]:
def check_predictions(predictions):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"
        assert predictions['next_item_prediction'].apply(lambda x: isinstance(x, str)).all(), "Predictions should all be strings"

In [15]:
check_predictions(predictions)

In [16]:
# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'submission_{task}.parquet', engine='pyarrow')

## Submit to AIcrowd üöÄ

In [None]:
# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-3-next-product-title-generation -f "submission_task3.parquet"