{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"getting-started-code-nlp-feature-engineering.ipynb","provenance":[],"collapsed_sections":[]},"kernelspec":{"display_name":"Python [conda env:ml]","language":"python","name":"conda-env-ml-py"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"UKeGXGskEBEW"},"source":["# Solution for NLP Feature Engineering LB: 0.803\n","\n","This solution consists utilises a count vectorizer a TF IDF and a stopword filter as feature engineering."]},{"cell_type":"markdown","metadata":{"id":"sDLGIamsZI-Q"},"source":["## AIcrowd Runtime Configuration 🧷\n","\n","Define configuration parameters. Please include any files needed for the notebook to run under `ASSETS_DIR`. We will copy the contents of this directory to your final submission file 🙂\n","\n","The dataset is available under `/data` on the workspace."]},{"cell_type":"code","metadata":{"id":"XcY_MMl_ZLur","executionInfo":{"status":"ok","timestamp":1624453552775,"user_tz":-120,"elapsed":5,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}}},"source":["import os\n","\n","# Please use the absolute for the location of the dataset.\n","# Or you can use relative path with `os.getcwd() + \"test_data/test.csv\"`\n","AICROWD_DATASET_PATH = os.getenv(\"DATASET_PATH\", os.getcwd()+\"/data/data.csv\")\n","AICROWD_OUTPUTS_PATH = os.getenv(\"OUTPUTS_DIR\", \"\")\n","AICROWD_ASSETS_DIR = os.getenv(\"ASSETS_DIR\", \"assets\")"],"execution_count":1,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"k3bHV9SL5tnX"},"source":["# Install packages 🗃\n","\n","We are going to use sklearn to do Count Vectorization and TF IDF."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"IXBdVazs4s09","executionInfo":{"status":"ok","timestamp":1624453572328,"user_tz":-120,"elapsed":19558,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}},"outputId":"c1962951-0089-4abf-d72e-a9a601331a9c"},"source":["!pip install --upgrade scikit-learn gensim\n","!pip install -q -U aicrowd-cli"],"execution_count":2,"outputs":[{"output_type":"stream","text":["Collecting scikit-learn\n","\u001b[?25l  Downloading https://files.pythonhosted.org/packages/a8/eb/a48f25c967526b66d5f1fa7a984594f0bf0a5afafa94a8c4dbc317744620/scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)\n","\u001b[K     |████████████████████████████████| 22.3MB 61.0MB/s \n","\u001b[?25hCollecting gensim\n","\u001b[?25l  Downloading https://files.pythonhosted.org/packages/44/52/f1417772965652d4ca6f901515debcd9d6c5430969e8c02ee7737e6de61c/gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9MB)\n","\u001b[K     |████████████████████████████████| 23.9MB 39.2MB/s \n","\u001b[?25hRequirement already satisfied, skipping upgrade: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)\n","Collecting threadpoolctl>=2.0.0\n","  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl\n","Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)\n","Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)\n","Requirement already satisfied, skipping upgrade: smart-open>=1.8.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (5.1.0)\n","Installing collected packages: threadpoolctl, scikit-learn, gensim\n","  Found existing installation: scikit-learn 0.22.2.post1\n","    Uninstalling scikit-learn-0.22.2.post1:\n","      Successfully uninstalled scikit-learn-0.22.2.post1\n","  Found existing installation: gensim 3.6.0\n","    Uninstalling gensim-3.6.0:\n","      Successfully uninstalled gensim-3.6.0\n","Successfully installed gensim-4.0.1 scikit-learn-0.24.2 threadpoolctl-2.1.0\n","\u001b[K     |████████████████████████████████| 51kB 3.2MB/s \n","\u001b[K     |████████████████████████████████| 81kB 6.3MB/s \n","\u001b[K     |████████████████████████████████| 215kB 12.6MB/s \n","\u001b[K     |████████████████████████████████| 174kB 12.7MB/s \n","\u001b[K     |████████████████████████████████| 61kB 7.5MB/s \n","\u001b[K     |████████████████████████████████| 61kB 6.8MB/s \n","\u001b[K     |████████████████████████████████| 51kB 6.3MB/s \n","\u001b[K     |████████████████████████████████| 71kB 8.5MB/s \n","\u001b[31mERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.\u001b[0m\n","\u001b[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.\u001b[0m\n","\u001b[?25h"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"97b5nvSbZQvk"},"source":["# Define preprocessing code 💻\n","\n","The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here."]},{"cell_type":"code","metadata":{"id":"YjhfHq-FWozF","executionInfo":{"status":"ok","timestamp":1624453572969,"user_tz":-120,"elapsed":643,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}}},"source":["from glob import glob\n","import os\n","import pandas as pd\n","import numpy as np\n","from sklearn import model_selection\n","from sklearn.tree import DecisionTreeClassifier\n","from sklearn.model_selection import train_test_split\n","from sklearn.metrics import f1_score, accuracy_score\n","import sklearn"],"execution_count":3,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"sWdct35aZTA-"},"source":["# Training phase ⚙️\n","\n","You can define your training code here. This sections will be skipped during evaluation.\n","\n","For this solution approach there is no training needed! 🙂"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"eXUYopYXDHAv","executionInfo":{"status":"ok","timestamp":1624453573848,"user_tz":-120,"elapsed":881,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}},"outputId":"8aaeda49-da06-4417-8432-4b6e7338e4de"},"source":["API_KEY = '' # Please get your your API Key from [https://www.aicrowd.com/participants/me]\n","!aicrowd login --api-key $API_KEY"],"execution_count":4,"outputs":[{"output_type":"stream","text":["\u001b[32mAPI Key valid\u001b[0m\n","\u001b[32mSaved API Key successfully!\u001b[0m\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"ItXbQk8hL6ke","executionInfo":{"status":"ok","timestamp":1624453578614,"user_tz":-120,"elapsed":4768,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}},"outputId":"411ddd16-e344-47bc-b39c-4f7f65e32a19"},"source":["# Downloading the Dataset\n","!mkdir data\n","!aicrowd dataset download --challenge nlp-feature-engineering -j 3 -o data"],"execution_count":5,"outputs":[{"output_type":"stream","text":["\rdata.csv:   0% 0.00/110k [00:00<?, ?B/s]\rdata.csv: 100% 110k/110k [00:00<00:00, 1.88MB/s]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"AEX46cGwZyuI"},"source":["# Prediction phase 🔎\n","\n","Generating the features in test dataset. "]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":363},"id":"ocahSMTyNkg9","executionInfo":{"status":"ok","timestamp":1624453578615,"user_tz":-120,"elapsed":7,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}},"outputId":"a8aa4120-b519-47ef-ca51-9fdf6c7d93be"},"source":["test_dataset = pd.read_csv(AICROWD_DATASET_PATH)\n","test_dataset"],"execution_count":6,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>id</th>\n","      <th>text</th>\n","      <th>feature</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0</td>\n","      <td>Zero-divisors (ZDs) derived by Cayley-Dickson ...</td>\n","      <td>[0.3745401188473625, 0.9507143064099162, 0.731...</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>1</td>\n","      <td>This paper is an exposition of the so-called i...</td>\n","      <td>[0.9327284833540133, 0.8660638895004084, 0.045...</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>2</td>\n","      <td>Zero-divisors (ZDs) derived by Cayley-Dickson ...</td>\n","      <td>[0.9442664891134339, 0.47421421665746377, 0.86...</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>3</td>\n","      <td>We calculate the equation of state of dense hy...</td>\n","      <td>[0.18114934953468032, 0.6811178539649828, 0.18...</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>4</td>\n","      <td>The Donald-Flanigan conjecture asserts that fo...</td>\n","      <td>[0.5435382173426461, 0.08172534574677826, 0.45...</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>5</td>\n","      <td>Let $E$ be a primarily quasilocal field, $M/E$...</td>\n","      <td>[0.7945155444907487, 0.7070864772666982, 0.050...</td>\n","    </tr>\n","    <tr>\n","      <th>6</th>\n","      <td>6</td>\n","      <td>The paper deals with the study of labor market...</td>\n","      <td>[0.3129073942136482, 0.27109625376406576, 0.59...</td>\n","    </tr>\n","    <tr>\n","      <th>7</th>\n","      <td>7</td>\n","      <td>Axisymmetric equilibria with incompressible fl...</td>\n","      <td>[0.40680480095172356, 0.3282331056783394, 0.45...</td>\n","    </tr>\n","    <tr>\n","      <th>8</th>\n","      <td>8</td>\n","      <td>This paper analyses the possibilities of perfo...</td>\n","      <td>[0.013682414760681105, 0.08159872000483837, 0....</td>\n","    </tr>\n","    <tr>\n","      <th>9</th>\n","      <td>9</td>\n","      <td>I show that an (n+2)-dimensional n-Lie algebra...</td>\n","      <td>[0.9562918815133613, 0.37667644042946247, 0.33...</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["   id  ...                                            feature\n","0   0  ...  [0.3745401188473625, 0.9507143064099162, 0.731...\n","1   1  ...  [0.9327284833540133, 0.8660638895004084, 0.045...\n","2   2  ...  [0.9442664891134339, 0.47421421665746377, 0.86...\n","3   3  ...  [0.18114934953468032, 0.6811178539649828, 0.18...\n","4   4  ...  [0.5435382173426461, 0.08172534574677826, 0.45...\n","5   5  ...  [0.7945155444907487, 0.7070864772666982, 0.050...\n","6   6  ...  [0.3129073942136482, 0.27109625376406576, 0.59...\n","7   7  ...  [0.40680480095172356, 0.3282331056783394, 0.45...\n","8   8  ...  [0.013682414760681105, 0.08159872000483837, 0....\n","9   9  ...  [0.9562918815133613, 0.37667644042946247, 0.33...\n","\n","[10 rows x 3 columns]"]},"metadata":{"tags":[]},"execution_count":6}]},{"cell_type":"markdown","metadata":{"id":"9yef7CHWE-yG"},"source":["### Count Vectorizer 🔢\n","A count vectorizer outputs a text as a matrix of counts of the related word. <br> \n","It has a vocabulary that includes every word that is present in the data.\n","When it converts a text into a vector, it first counts all the words. \n","<br>\n","For example, if the first digit of the vector contains the word \"hello\" and \"hello\" is counted 2 times in the text, then the number 2 will be in this position.\n","<br>\n","The advantage of this method is that the vector always has the same size and is therefore independent of the input.\n","\n","### TFidf 📐\n","Here is a very in-depth explanation: <br>\n","https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a \n","<br>\n","\n","### Stopwords\n","Stopwords are words which have little meaning. If they are removed it should improve the compression of the text data into smaller vectors.\n","\n","Somehow to submit something the output needs to be integers otherwise the evaluation will fail."]},{"cell_type":"code","metadata":{"id":"iNAkGqKBv-go","colab":{"base_uri":"https://localhost:8080/","height":420},"executionInfo":{"status":"ok","timestamp":1624453579056,"user_tz":-120,"elapsed":445,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}},"outputId":"a9f8b8cf-c9fa-445e-888b-3ce6a1896b72"},"source":["from gensim.parsing.preprocessing import remove_stopwords\n","from sklearn.feature_extraction.text import CountVectorizer\n","count_vect = CountVectorizer(max_features = 512)\n","X_train_counts = count_vect.fit_transform([remove_stopwords(i) for i in test_dataset.text.tolist()])\n","\n","from sklearn.feature_extraction.text import TfidfTransformer\n","tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)\n","X_train_tf = tf_transformer.transform(X_train_counts)\n","X_train_tf = np.round(X_train_tf.toarray()*5).astype(int) # Multiply by 5 is better than 100\n","\n","test_dataset.feature = [str(i) for i in X_train_tf.tolist()]\n","test_dataset"],"execution_count":7,"outputs":[{"output_type":"stream","text":["/usr/local/lib/python3.7/dist-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n","  warnings.warn(msg)\n"],"name":"stderr"},{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>id</th>\n","      <th>text</th>\n","      <th>feature</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0</td>\n","      <td>Zero-divisors (ZDs) derived by Cayley-Dickson ...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>1</td>\n","      <td>This paper is an exposition of the so-called i...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>2</td>\n","      <td>Zero-divisors (ZDs) derived by Cayley-Dickson ...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>3</td>\n","      <td>We calculate the equation of state of dense hy...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>4</td>\n","      <td>The Donald-Flanigan conjecture asserts that fo...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>5</td>\n","      <td>Let $E$ be a primarily quasilocal field, $M/E$...</td>\n","      <td>[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>6</th>\n","      <td>6</td>\n","      <td>The paper deals with the study of labor market...</td>\n","      <td>[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>7</th>\n","      <td>7</td>\n","      <td>Axisymmetric equilibria with incompressible fl...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>8</th>\n","      <td>8</td>\n","      <td>This paper analyses the possibilities of perfo...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ...</td>\n","    </tr>\n","    <tr>\n","      <th>9</th>\n","      <td>9</td>\n","      <td>I show that an (n+2)-dimensional n-Lie algebra...</td>\n","      <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, ...</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["   id  ...                                            feature\n","0   0  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n","1   1  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...\n","2   2  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n","3   3  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n","4   4  ...  [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ...\n","5   5  ...  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n","6   6  ...  [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...\n","7   7  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...\n","8   8  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ...\n","9   9  ...  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, ...\n","\n","[10 rows x 3 columns]"]},"metadata":{"tags":[]},"execution_count":7}]},{"cell_type":"code","metadata":{"id":"R07_U_YFwC9C","executionInfo":{"status":"ok","timestamp":1624453579056,"user_tz":-120,"elapsed":3,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}}},"source":["# Saving the sample submission\n","test_dataset.to_csv(os.path.join(AICROWD_OUTPUTS_PATH,'submission.csv'), index=False)"],"execution_count":8,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"tx6nuUzrwnC6"},"source":["# Submit to AIcrowd 🚀\n","\n","**Note : Please save the notebook before submitting it (Ctrl + S)**"]},{"cell_type":"code","metadata":{"id":"X95c_97SwqP3"},"source":["!DATASET_PATH=$AICROWD_DATASET_PATH \\\n","aicrowd -v notebook submit \\\n","    --assets-dir $AICROWD_ASSETS_DIR \\\n","    --challenge nlp-feature-engineering"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"t1DM5rWjDMZH"},"source":["##Congratulations 🎉!<br>\n","Now you have an understanding of how to do simple feature engineering in NLP.\n","<br>\n","If you liked it please leave a like.\n","<br><br>\n","PS: The original notebook I copied it from is the getting-stated notebook by \n","Shubhamaicrowd.\n"]},{"cell_type":"code","metadata":{"id":"rHpS9kdKaOEq","executionInfo":{"status":"ok","timestamp":1624453614744,"user_tz":-120,"elapsed":8,"user":{"displayName":"yunlai li","photoUrl":"","userId":"11696035370877079017"}}},"source":[""],"execution_count":9,"outputs":[]}]}