{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "e235bb95-5618-4b70-ae07-6bb4f855922a", "metadata": {}, "outputs": [], "source": [ "# !pip install contractions" ] }, { "cell_type": "code", "execution_count": 2, "id": "6b117c76-f3ef-429e-b971-9d820679320d", "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from nltk.corpus import PlaintextCorpusReader\n", "from nltk.corpus import stopwords\n", "from nltk.stem.porter import *\n", "from nltk import pos_tag, word_tokenize\n", "from nltk.stem import WordNetLemmatizer\n", "from nltk.probability import FreqDist\n", "from nltk.tokenize import sent_tokenize\n", "from nltk.tokenize import word_tokenize\n", "import contractions\n", "\n", "import gensim\n", "from gensim import corpora\n", "from gensim import similarities\n", "from gensim import models\n", "from gensim.models import CoherenceModel\n", "\n", "# from wordcloud import WordCloud, ImageColorGenerator\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas as pd\n", "import re\n", "import os\n", "import glob\n", "import json\n", "\n", "import psycopg2\n", "import pickle\n", "from datetime import datetime\n", "import datetime" ] }, { "cell_type": "markdown", "id": "228f935e-fb42-4ffb-bedc-a5a50dbd0bd9", "metadata": {}, "source": [ "# Import Data" ] }, { "cell_type": "code", "execution_count": 4, "id": "cc0cac0a-2a0b-4586-b970-ef3c6f3e85fe", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"cleaned_data.csv\")" ] }, { "cell_type": "markdown", "id": "fcc0d9ad-116c-4e87-b349-89b0118238cb", "metadata": {}, "source": [ "# EDA" ] }, { "cell_type": "code", "execution_count": null, "id": "c76090be", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "476ff0c9-e23d-4cb5-8147-e764df46d2c8", "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "dbde7785-b2dc-4bfe-8740-19149f676b9e", "metadata": {}, "outputs": [], "source": [ "df_copy = df.copy()" ] }, { "cell_type": "code", "execution_count": null, "id": "543d8eff-5f20-43c2-81c2-0ca5ac75a10e", "metadata": {}, "outputs": [], "source": [ "df_copy.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "48d0221f-b42e-4776-ac50-9a201a9f90c3", "metadata": {}, "outputs": [], "source": [ "df_copy.isnull().sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "a9e0e53d-bd4a-4227-a36b-d22fb4064b56", "metadata": {}, "outputs": [], "source": [ "df_copy.dropna(subset=[\"Headline_Details\"], inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "8b7f4b70-8205-4d42-a9b8-986de57fdb45", "metadata": {}, "outputs": [], "source": [ "print(\"Published Date Statistics:\")\n", "print(\"Min Date:\", df_copy[\"Datetime\"].min())\n", "print(\"Max Date:\", df_copy[\"Datetime\"].max())" ] }, { "cell_type": "code", "execution_count": null, "id": "97988848-03f7-4b1a-b2c5-eb8cf9ff3dfa", "metadata": {}, "outputs": [], "source": [ "# Check if there are any duplicated titles since a news can be published for multiple times by different publisher at different time\n", "df_copy[[\"Year\", \"Headline_Details\", \"Region\"]].duplicated().any()" ] }, { "cell_type": "code", "execution_count": null, "id": "af3e758a-4114-4273-99e8-00e717b7071c", "metadata": {}, "outputs": [], "source": [ "# drop the duplicated news\n", "duplicates = df_copy.duplicated(\n", " subset=[\"Year\", \"Headline_Details\", \"Region\"], keep=\"first\"\n", ")\n", "df_uni = df_copy[~duplicates]" ] }, { "cell_type": "code", "execution_count": null, "id": "2a23a150-c8a2-421f-8096-5070f3d8747a", "metadata": {}, "outputs": [], "source": [ "df_uni.shape" ] }, { "cell_type": "markdown", "id": "4c9a6dbf-a74c-4bc7-a78e-619a9212cccc", "metadata": {}, "source": [ "# Text Preprocessing\n", "contractions -> punctuation removal -> lowercase -> -> lemmanisation -> stop words removal + bigram" ] }, { "cell_type": "code", "execution_count": null, "id": "8cf125a0-1444-43cc-9b84-4c0ed6c4bc02", "metadata": {}, "outputs": [], "source": [ "df_uni[\"Headline_Details\"][5]" ] }, { "cell_type": "code", "execution_count": null, "id": "5481f424-321b-467d-8327-db25c32f1bd3", "metadata": {}, "outputs": [], "source": [ "## remove contractions, lowercase, remove numbers and punctuations, remove stopwords\n", "# run time roughly 2 mins\n", "df_uni[\"cleaned_Headline_Details\"] = df_uni[\"Headline_Details\"].apply(\n", " lambda x: [contractions.fix(word) for word in x.split()]\n", ")\n", "\n", "## convert back into string so that tokenization can be done\n", "df_uni[\"cleaned_Headline_Details\"] = [\n", " \" \".join(map(str, l)) for l in df_uni[\"cleaned_Headline_Details\"]\n", "]" ] }, { "cell_type": "code", "execution_count": null, "id": "b761cac7-f544-40d4-bea0-a39b4f994083", "metadata": {}, "outputs": [], "source": [ "df_uni[\"cleaned_Headline_Details\"][5]" ] }, { "cell_type": "markdown", "id": "e92e3b8b-cda4-425a-ab4f-eaeb6e500379", "metadata": {}, "source": [ "### Stemming / Lemmatization - To normalize text and prepare words.\n", "\n", "https://towardsdatascience.com/stemming-vs-lemmatization-in-nlp-dea008600a0#:~:text=Stemming%20and%20Lemmatization%20are%20methods,be%20used%20in%20similar%20contexts.\n", "\n", "Decided to use lemmatization because lemmatization provides better results by performing an analysis that depends on the word’s part-of-speech and producing real, dictionary words. As a result, lemmatization is harder to implement and slower compared to stemming.\n", "\n", "To sum up, lemmatization is almost always a better choice from a qualitative point of view. With today’s computational resources, running lemmatization algorithms shouldn’t have a significant impact on the overall performance. However, if we are heavily optimizing for speed, a simpler stemming algorithm can be a possibility." ] }, { "cell_type": "markdown", "id": "cccdd491-55cb-42e9-9e08-3ab01253b2d0", "metadata": {}, "source": [ "POS taggin + lemming for better lemming performance. However, the lemmatizer requires the correct POS tag to be accurate, \n", "if you use the default settings of the WordNetLemmatizer.lemmatize(), the default tag is noun.\n", "\n", "https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39 " ] }, { "cell_type": "raw", "id": "c25de86c-3bfd-4999-9849-bf41f3a0d167", "metadata": {}, "source": [ "stop_words = set(stopwords.words('english'))\n", "stemmer = PorterStemmer()\n", "\n", "def preprocess(review):\n", " review = \" \".join([stemmer.stem(w.lower()) for w in word_tokenize(re.sub('[^a-zA-Z]+', ' ', review.replace(\"
\", \"\"))) if not w in stop_words])\n", " return review\n", "\n", "# as a result, it stores a normalised text sentences (string)\n", "data['review_clean'] = data.apply(lambda x: preprocess(x['review']), axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "c02dbc2a-9583-4122-a814-a8b723bbbafd", "metadata": {}, "outputs": [], "source": [ "# ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'\n", "# keep only ADJ, ADV, NOUN and VERB.\n", "\n", "wnl = WordNetLemmatizer()\n", "\n", "\n", "def lemmatize_words(text):\n", " # Tokenize the text into sentences and then words\n", " sentences = sent_tokenize(text)\n", " words = [word_tokenize(sentence) for sentence in sentences]\n", "\n", " # Remove punctuation and tokenize into lowercase words\n", " punc = [[w.lower() for w in word if re.search(\"^[a-zA-Z]+$\", w)] for word in words]\n", "\n", " # Perform lemmatization on words with valid POS tags\n", " doc_lemmed = [\n", " wnl.lemmatize(word, pos[0].lower())\n", " for sentence in punc\n", " for word, pos in pos_tag(sentence, tagset=\"universal\")\n", " if pos[0].lower() in [\"a\", \"s\", \"r\", \"n\", \"v\"]\n", " ]\n", "\n", " return doc_lemmed" ] }, { "cell_type": "code", "execution_count": null, "id": "d75603cb-219e-4e4a-83db-023fe1226e04", "metadata": {}, "outputs": [], "source": [ "print(datetime.datetime.now())" ] }, { "cell_type": "code", "execution_count": null, "id": "c4e38b71-8b47-4482-8439-f5142d3229dc", "metadata": {}, "outputs": [], "source": [ "df_uni[\"cleaned_Headline_Details\"] = df_uni[\"cleaned_Headline_Details\"].apply(\n", " lemmatize_words\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "7f7d3945-2e52-401c-8e67-16e15e8e834e", "metadata": {}, "outputs": [], "source": [ "print(datetime.datetime.now())" ] }, { "cell_type": "markdown", "id": "49702368-06a3-4965-9846-778e350254d4", "metadata": {}, "source": [ "### N-gram + Stopword removal" ] }, { "cell_type": "code", "execution_count": null, "id": "979eb98f-8ff5-4b63-8787-70bea21f843b", "metadata": {}, "outputs": [], "source": [ "stop_list = nltk.corpus.stopwords.words(\"english\")\n", "stop_list += [\"local\", \"time\", \"wednesday\", \"source\", \"certain\", \"report\", \"update\"]\n", "\n", "\n", "def corpus2docs2(corpus):\n", " # corpus is a object returned by load_corpus that represents a corpus.\n", " docs = []\n", " for text in corpus:\n", " cleaned = [w for w in text if w not in stop_list]\n", " doc_pos = nltk.pos_tag(cleaned)\n", " phrases = []\n", " i = 0\n", " while i < len(doc_pos):\n", " if doc_pos[i][1] == \"JJ\":\n", " if (\n", " i + 2 < len(doc_pos)\n", " and doc_pos[i + 1][1] == \"NN\"\n", " and doc_pos[i + 2][1] == \"NN\"\n", " ):\n", " phrases.append(\n", " (doc_pos[i][0], doc_pos[i + 1][0], doc_pos[i + 2][0])\n", " )\n", " i += 3\n", " elif i + 1 < len(doc_pos) and doc_pos[i + 1][1] == \"NN\":\n", " phrases.append((doc_pos[i][0], doc_pos[i + 1][0]))\n", " i += 2\n", " else:\n", " i += 1\n", " elif doc_pos[i][1] == \"NN\":\n", " if (\n", " i + 2 < len(doc_pos)\n", " and doc_pos[i + 1][1] == \"NN\"\n", " and doc_pos[i + 2][1] == \"NN\"\n", " ):\n", " phrases.append(\n", " (doc_pos[i][0], doc_pos[i + 1][0], doc_pos[i + 2][0])\n", " )\n", " i += 3\n", " elif i + 1 < len(doc_pos) and doc_pos[i + 1][1] == \"NN\":\n", " phrases.append((doc_pos[i][0], doc_pos[i + 1][0]))\n", " i += 2\n", " else:\n", " i += 1\n", " else:\n", " i += 1\n", " phrase_set = [\"_\".join(word_set) for word_set in phrases]\n", " docs.append(phrase_set)\n", " return docs" ] }, { "cell_type": "code", "execution_count": null, "id": "c0a72831-6987-41a2-9be4-297f5d049d91", "metadata": {}, "outputs": [], "source": [ "print(stop_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "d5c48c2c-5fd7-4216-a699-e69893e2aee8", "metadata": {}, "outputs": [], "source": [ "df_uni[\"binary_Headline_Details\"] = corpus2docs2(df_uni[\"cleaned_Headline_Details\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "812d9fe2-89ef-4bbc-901e-a4d1fd8e2eb0", "metadata": {}, "outputs": [], "source": [ "df_uni[\"binary_Headline_Details\"][5]" ] }, { "cell_type": "code", "execution_count": null, "id": "faeb3af3-0a05-46be-98fa-d8a88b075049", "metadata": {}, "outputs": [], "source": [ "fdist_doc = nltk.FreqDist(df_uni[\"binary_Headline_Details\"][5]).most_common(25)\n", "\n", "x, y = zip(*fdist_doc)\n", "plt.figure(figsize=(50, 30))\n", "plt.margins(0.02)\n", "plt.bar(x, y)\n", "plt.xlabel(\"Words\", fontsize=50)\n", "plt.ylabel(\"Frequency of Words\", fontsize=50)\n", "plt.yticks(fontsize=40)\n", "plt.xticks(rotation=60, fontsize=40)\n", "plt.title(\"Frequency of 25 Most Common Words for One Random News\", fontsize=60)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "052a4218-a73e-4df6-8b16-45e58e9da58a", "metadata": {}, "outputs": [], "source": [ "all_words = [word for sublist in df_uni[\"binary_Headline_Details\"] for word in sublist]\n", "all_words[:2]\n", "# Calculate word frequencies\n", "fdist = FreqDist(all_words)" ] }, { "cell_type": "code", "execution_count": null, "id": "fd07bf53-a547-43dc-9dce-070d7ac2dd4c", "metadata": {}, "outputs": [], "source": [ "# Plot the word frequency distribution as a bar graph\n", "plt.figure(figsize=(12, 6))\n", "plt.title(\"Frequency of 25 Most Common Words of the Dataset\", fontsize=12)\n", "fdist.plot(30, cumulative=False)" ] }, { "cell_type": "markdown", "id": "c032514e-2516-41fa-9efb-0eb0197c0fc2", "metadata": {}, "source": [ "# Wordcloud" ] }, { "cell_type": "code", "execution_count": null, "id": "81edd7d1-eb00-430a-9b57-a5ff39737982", "metadata": {}, "outputs": [], "source": [ "com = df_uni[\"Severity\"].unique()\n", "com[:10]" ] }, { "cell_type": "code", "execution_count": null, "id": "0099a39e-dbb4-41d5-afde-a2b135cb5866", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from wordcloud import WordCloud\n", "\n", "# Plotting with Seaborn for each company\n", "for company in com[:10]:\n", " haha = df_uni[\"binary_Headline_Details\"].loc[df_uni.Severity == company]\n", " text = \" \".join(\" \".join(item) for item in haha)\n", " wordcloud = WordCloud(background_color=\"white\").generate(text)\n", " plt.imshow(wordcloud, interpolation=\"bilinear\")\n", " plt.title(f\"Wordcloud for {company}\")\n", " plt.axis(\"off\")\n", " plt.margins(x=0, y=0)\n", " plt.show()" ] }, { "cell_type": "markdown", "id": "1658838b-bb13-4471-998d-1013ec28da3d", "metadata": {}, "source": [ "## IT-IDF Word Removal\n", "\n", "remove those frequently appeared but less important words like say, will, year, use, etc." ] }, { "cell_type": "code", "execution_count": null, "id": "da95670f-1123-4cf4-80f9-e2ec3ca6041c", "metadata": {}, "outputs": [], "source": [ "df_uni[\"binary_Headline_Details\"] = df_uni[\"binary_Headline_Details\"].apply(\n", " lambda x: \" \".join(x)\n", ")\n", "\n", "# Tokenize the text and create a dictionary\n", "documents = df_uni[\"binary_Headline_Details\"].str.split()\n", "dictionary = corpora.Dictionary(documents)\n", "\n", "tfidf = models.TfidfModel(dictionary=dictionary, normalize=True)\n", "tfidf_corpus = [tfidf[dictionary.doc2bow(doc)] for doc in documents]\n", "term_frequencies = {dictionary[id]: freq for id, freq in tfidf.dfs.items()}" ] }, { "cell_type": "code", "execution_count": null, "id": "85b3f6ba-4a57-4f6c-9954-cfd0ac189e5a", "metadata": { "scrolled": true }, "outputs": [], "source": [ "sorted_term_frequencies = dict(\n", " sorted(term_frequencies.items(), key=lambda item: item[1], reverse=True)\n", ")\n", "sorted_term_frequencies" ] }, { "cell_type": "markdown", "id": "ff92ddb0-4b34-4833-8d88-210ebbc5acfd", "metadata": {}, "source": [ "threshold = 0.04 seems to be an appropriate cutoff with variation at +- 0.01 for this set of data." ] }, { "cell_type": "code", "execution_count": null, "id": "5f2484b7-daa2-4654-aff6-5371619182c5", "metadata": {}, "outputs": [], "source": [ "# customisable, lower threshold, more words retained.\n", "threshold = 0.4\n", "\n", "\n", "def filter_and_join(tfidf_doc):\n", " filtered_terms = [dictionary[id] for id, score in tfidf_doc if score >= threshold]\n", " return filtered_terms\n", "\n", "\n", "df_uni[\"binary_Headline_Details\"] = [filter_and_join(doc) for doc in tfidf_corpus]" ] }, { "cell_type": "code", "execution_count": null, "id": "58a8778c-1005-488c-a54a-6c6c6b02f05f", "metadata": {}, "outputs": [], "source": [ "fdist_doc = nltk.FreqDist(df_uni[\"binary_Headline_Details\"][0]).most_common(25)\n", "\n", "x, y = zip(*fdist_doc)\n", "plt.figure(figsize=(50, 30))\n", "plt.margins(0.02)\n", "plt.bar(x, y)\n", "plt.xlabel(\"Words\", fontsize=50)\n", "plt.ylabel(\"Frequency of Words\", fontsize=50)\n", "plt.yticks(fontsize=40)\n", "plt.xticks(rotation=60, fontsize=40)\n", "plt.title(\"Frequency of 25 Most Common Words for One Random News\", fontsize=60)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "0bdfcd90-a066-4864-8778-9f9d0fb715ce", "metadata": {}, "outputs": [], "source": [ "all_words_filtered = [\n", " word for sublist in df_uni[\"binary_Headline_Details\"] for word in sublist\n", "]\n", "all_words_filtered[:2]\n", "# Calculate word frequencies\n", "fdist_filtered = FreqDist(all_words_filtered)" ] }, { "cell_type": "code", "execution_count": null, "id": "5c2e8cc6-db44-4f0d-9c33-841bae1f5094", "metadata": {}, "outputs": [], "source": [ "# Plot the word frequency distribution as a bar graph\n", "# apparently, the dataset is much cleaner now.\n", "plt.figure(figsize=(12, 6))\n", "plt.title(\"Frequency of 25 Most Common Words of the Dataset\", fontsize=12)\n", "fdist_filtered.plot(30, cumulative=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "b3e99195-1c0a-4c7d-91d7-9a84e8ab1422", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from wordcloud import WordCloud\n", "\n", "# Plotting with Seaborn for each company\n", "for region in com[:10]:\n", " haha = df_uni[\"binary_Headline_Details\"].loc[df_uni.Severity == region]\n", " text = \" \".join(\" \".join(item) for item in haha)\n", " wordcloud = WordCloud(background_color=\"white\").generate(text)\n", " plt.imshow(wordcloud, interpolation=\"bilinear\")\n", " plt.title(f\"Wordcloud for {company}\")\n", " plt.axis(\"off\")\n", " plt.margins(x=0, y=0)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "91c359e9-105a-4bcd-bef7-fbc97538667a", "metadata": {}, "outputs": [], "source": [ "df_uni[\"word_count\"] = df_uni[\"binary_Headline_Details\"].apply(len)" ] }, { "cell_type": "code", "execution_count": null, "id": "4fb07549-2792-4158-8f5d-861b2d0ea487", "metadata": {}, "outputs": [], "source": [ "df_uni[[\"word_count\"]].describe().round()" ] }, { "cell_type": "code", "execution_count": null, "id": "5a46305b-47fb-486e-9a4b-93653c555df9", "metadata": {}, "outputs": [], "source": [ "# count of news by sector\n", "df_uni[[\"binary_Headline_Details\", \"Region\"]].groupby(\"Region\").count().sort_values(\n", " by=\"binary_Headline_Details\", ascending=False\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "02f79cd8-dd17-4ed8-b077-39dd5cc730d3", "metadata": {}, "outputs": [], "source": [ "df_uni[[\"binary_Headline_Details\", \"Severity\"]].groupby(\"Severity\").count().sort_values(\n", " by=\"binary_Headline_Details\", ascending=False\n", ")" ] }, { "cell_type": "markdown", "id": "5d4d9a0c-63ab-4ef1-a196-2be7014b1476", "metadata": {}, "source": [ "# Save data to database for modelling" ] }, { "cell_type": "code", "execution_count": null, "id": "87f5e776-4d52-42c6-b6cb-a33f57a7e131", "metadata": {}, "outputs": [], "source": [ "df_uni.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b66157f8-1f2a-48f5-99b6-a4e0db0ab57f", "metadata": {}, "outputs": [], "source": [ "df_uni.columns.to_list()" ] }, { "cell_type": "code", "execution_count": null, "id": "38cad3d6-7799-4c69-a876-782ab411395a", "metadata": {}, "outputs": [], "source": [ "# export as parquet data file instead of csv for easier list extraction\n", "df_uni.to_parquet(\"processed_data1.parquet\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "bad76710-d7fd-4976-a014-c850483df8fc", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 5 }