{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Daily" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can specify devices for storage and calculation, such as the CPU or GPU. By default, data are created in the main memory and then use the CPU for calculations.\n", "\n", "The deep learning framework requires all input data for calculation to be on the same device, be it CPU or the same GPU.\n", "\n", "You can lose significant performance by moving data without care. A typical mistake is as follows: computing the loss for every minibatch on the GPU and reporting it back to the user on the command line (or logging it in a NumPy ndarray) will trigger a global interpreter lock which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and only move larger logs.\n", "\n", "- for Tensorflow-2: You can just use LSTM with no activation specified (ied default to tanh) function and it will automatically use the CuDNN version\n", "- Gradient clipping is a technique to prevent exploding gradients in very deep networks, usually in recurrent neural networks. ... This prevents any gradient to have norm greater than the threshold and thus the gradients are clipped.\n", "\n", "### Tips for Activation Functions:\n", "- When using the ReLU function for hidden layers, it is a good practice to use a “He Normal” or “He Uniform” weight initialization and scale input data to the range 0-1 (normalize) prior to training.\n", "- When using the Sigmoid function for hidden layers, it is a good practice to use a “Xavier Normal” or “Xavier Uniform” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input data to the range 0-1 (e.g. the range of the activation function) prior to training.\n", "- When using the TanH function for hidden layers, it is a good practice to use a “Xavier Normal” or “Xavier Uniform” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input data to the range -1 to 1 (e.g. the range of the activation function) prior to training." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "import tensorflow as tf\n", "import tensorflow.keras\n", "from tensorflow.keras.utils import to_categorical, plot_model\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import LSTM, Dense, GRU, Embedding, Bidirectional, TimeDistributed, BatchNormalization, Flatten\n", "from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard\n", "from tensorflow.keras.regularizers import l2\n", "from tqdm import tqdm\n", "# from keras_tqdm import TQDMNotebookCallback\n", "import tensorflow.keras.backend as K\n", "import os\n", "import time\n", "import pandas as pd\n", "import numpy as np\n", "import psutil\n", "# Ignore harmless warnings\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set_style('darkgrid')\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.3.1\n", "2.4.0\n" ] } ], "source": [ "# tf.debugging.set_log_device_placement(True)\n", "print(tf.__version__)\n", "print(tensorflow.keras.__version__)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/home/ec2-user/Models'" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using GPU\n" ] } ], "source": [ "gpu_devices = tf.config.list_physical_devices('GPU')\n", "\n", "if gpu_devices:\n", " print('Using GPU')\n", " for gpu in gpu_devices[0:2]:\n", " tf.config.experimental.set_memory_growth(gpu, True)\n", "else:\n", " print('Using CPU')\n", " tf.config.optimizer.set_jit(True)\n", " print('used: {}% free: {:.2f}GB'.format(psutil.virtual_memory().percent, float(psutil.virtual_memory().free)/1024**3))#@ " ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8 Physical GPU, 2 Logical GPUs\n" ] } ], "source": [ "gpus = tf.config.list_physical_devices('GPU')\n", "if gpus:\n", " # Restrict TensorFlow to only use some GPUs\n", " try:\n", " tf.config.experimental.set_visible_devices(gpus[0:2], 'GPU')\n", " logical_gpus = tf.config.experimental.list_logical_devices('GPU')\n", " print(len(gpus), \"Physical GPU,\", len(logical_gpus), \"Logical GPUs\")\n", " except RuntimeError as e:\n", " # Visible devices must be set at program startup\n", " print(e)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[LogicalDevice(name='/device:GPU:0', device_type='GPU'),\n", " LogicalDevice(name='/device:GPU:1', device_type='GPU')]" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.config.experimental.list_logical_devices('GPU')" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "used: 2.8% free: 351.21GB\n" ] } ], "source": [ "import psutil\n", "print('used: {}% free: {:.2f}GB'.format(psutil.virtual_memory().percent, float(psutil.virtual_memory().free)/1024**3))" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "# Prepare News headlines\n", "def clean_text(df, column):\n", " import re \n", " #(\"\".join(headline)).strip()\n", " headline = []\n", " for i in df[column].apply(lambda x: ''+x+'<\\s>'):\n", " headline.append(i)\n", " return headline\n", "\n", "#get sequences of equal length to ensure <\\s is at the end\n", "def extract_end(char_seq, seq_len):\n", " if len(char_seq) > seq_len:\n", " char_seq = char_seq[:seq_len] #char_seq[-seq_len:]\n", " return char_seq\n", "\n", "# Encode to integers by using ascii 128\n", "def encode2bytes(text):\n", " #text = tf.strings.unicode_split(text, 'UTF-8').to_list()\n", " final_list = []\n", " for sent in text:\n", " temp_list = []\n", " for char in sent:\n", " if ord(char) < 128 :\n", " temp_list.append(ord(char))\n", " final_list.append(temp_list)\n", " return final_list\n", "\n", "def split_X_y(text):\n", " X = []\n", " y = []\n", " for i in text:\n", " X.append(i[0:-1])\n", " y.append(i[1:])\n", " return X,y\n", "\n", "def to_bytes(sequence):\n", " byte_text = []\n", " for i in data.headline:\n", " i = i.encode('utf-8')\n", " byte_text.append(i)\n", " return byte_text" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "# fix random seed for reproducibility\n", "K.clear_session()\n", "tf.keras.backend.clear_session()\n", "np.random.seed(42)\n", "tf.random.set_seed(42)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "idx = pd.IndexSlice\n", "max_length = 1000" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['/model_data/15_min', '/model_data/daily', '/model_data/hourly']\n", "[1. 0.]\n" ] } ], "source": [ "# Get data\n", "with pd.HDFStore('./model_data.h5', mode = 'r') as data:\n", " print(data.keys())\n", " data = data['model_data/daily']\n", " data.headline = data.headline.apply(lambda x: extract_end(x, max_length))\n", " data['headline'] = data.headline.apply(lambda x: '' + x + '<\\s')\n", " X = data.loc[:,'headline']\n", " y = data.loc[:, 'label']\n", " y[y<0] = 0\n", " print(y.unique())" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
headlineOpenClosereturnslabel
tickertime
A2019-07-11<s>Delivery Of Ionization Auxiliary Equipment ...1146.1600341144.0799560.0011011.0
2019-07-12<s>Epa Do Agilent 7250 Gc/q-tof Mass Spectrome...1142.9300541145.3399660.0045141.0
2019-07-16<s>AGILENT TECHNOLOGIES INC <A.N>: EVERCORE IS...1146.7299801153.459961-0.0058260.0
2019-07-17<s>Procurement Of Spares And Consumables For T...1150.9200441146.7399900.0004361.0
2019-07-18<s>NYSE ORDER IMBALANCE <A.N> 68900.0 SHARES O...1142.0000001147.239990-0.0136760.0
ZTS2020-09-28<s>NYSE ORDER IMBALANCE <ZTS.N> 59715.0 SHARES...162.380005161.3200070.0071911.0
2020-09-30<s>Zoetis to Host Webcast and Conference Call ...162.919998165.369995-0.0081030.0
2020-10-07<s>Sachem Head takes $1.2 bln position in Elan...163.350006159.9100040.0203861.0
2020-10-09<s>NYSE ORDER IMBALANCE <ZTS.N> 230640.0 SHARE...163.979996165.4299930.0188601.0
2020-10-12<s>ZOETIS INC <ZTS.N>: CREDIT SUISSE RAISES PR...167.080002168.550003-0.0191630.0
\n", "
" ], "text/plain": [ " headline \\\n", "ticker time \n", "A 2019-07-11 Delivery Of Ionization Auxiliary Equipment ... \n", " 2019-07-12 Epa Do Agilent 7250 Gc/q-tof Mass Spectrome... \n", " 2019-07-16 AGILENT TECHNOLOGIES INC : EVERCORE IS... \n", " 2019-07-17 Procurement Of Spares And Consumables For T... \n", " 2019-07-18 NYSE ORDER IMBALANCE 68900.0 SHARES O... \n", "ZTS 2020-09-28 NYSE ORDER IMBALANCE 59715.0 SHARES... \n", " 2020-09-30 Zoetis to Host Webcast and Conference Call ... \n", " 2020-10-07 Sachem Head takes $1.2 bln position in Elan... \n", " 2020-10-09 NYSE ORDER IMBALANCE 230640.0 SHARE... \n", " 2020-10-12 ZOETIS INC : CREDIT SUISSE RAISES PR... \n", "\n", " Open Close returns label \n", "ticker time \n", "A 2019-07-11 1146.160034 1144.079956 0.001101 1.0 \n", " 2019-07-12 1142.930054 1145.339966 0.004514 1.0 \n", " 2019-07-16 1146.729980 1153.459961 -0.005826 0.0 \n", " 2019-07-17 1150.920044 1146.739990 0.000436 1.0 \n", " 2019-07-18 1142.000000 1147.239990 -0.013676 0.0 \n", "ZTS 2020-09-28 162.380005 161.320007 0.007191 1.0 \n", " 2020-09-30 162.919998 165.369995 -0.008103 0.0 \n", " 2020-10-07 163.350006 159.910004 0.020386 1.0 \n", " 2020-10-09 163.979996 165.429993 0.018860 1.0 \n", " 2020-10-12 167.080002 168.550003 -0.019163 0.0 " ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head().append(data.tail())" ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
headlineOpenClosereturnslabel
tickertime
A2020-10-01<s>AGILENT TECHNOLOGIES, INC. -- S-8<\\s101.769997101.220001-0.0119540.0
2020-10-02<s>The Replacement Of Sampler Handler Assembly...100.209999100.0100020.0310971.0
2020-10-08<s>Agilent Measurement Suite celebrates early ...104.199997104.1600040.0153611.0
2020-10-09<s>Spare Parts For Equipment Manufactured By A...104.830002105.760002-0.0031200.0
2020-10-13<s>NYSE ORDER IMBALANCE <A.N> 50070.0 SHARES O...105.720001105.419998-0.0034150.0
AAL2020-10-01<s>American Airlines says it will begin furlou...12.41000012.5800000.0333861.0
2020-10-02<s>Refinitiv Newscasts - Airlines push ahead w...12.05000013.0000000.0092311.0
2020-10-05<s>American Airlines Adds Costa Rica to Prefli...13.09000013.120000-0.0449700.0
2020-10-06<s>AMERICAN AIRLINES DELAYS PLAN TO START BOEI...13.27000012.5300000.0430971.0
2020-10-07<s>BUZZ-U.S. airlines rebound as Trump pushes ...12.96000013.0700000.0061211.0
2020-10-08<s>BUZZ-U.S. STOCKS ON THE MOVE-Eaton Vance, I...13.37000013.1500000.0038021.0
2020-10-09<s>MCCONNELL SAYS AIRLINE AID SHOULD BE PART O...13.25000013.200000-0.0212120.0
2020-10-12<s>Refinitiv Newscasts - Flying during the COV...13.14000012.920000-0.0541800.0
AAP2020-10-01<s>Hot Shot’s Secret Everyday Diesel Treatment...154.639999154.8500060.0041981.0
2020-10-02<s>Advance Auto Parts Names Mann + Hummel 2020...152.240005155.5000000.0023151.0
2020-10-06<s>ADVANCE AUTO PARTS INC <AAP.N>: JP MORGAN R...160.259995155.1600040.0106341.0
AAPL2020-10-01<s>Zuckerberg eyes augmented reality as the fu...117.629997116.790001-0.0322800.0
2020-10-02<s>Apple working on foldable iPhone with 'self...112.889999113.0199970.0307911.0
2020-10-05<s>Apple, Google tracing apps limited Google, ...113.910004116.500000-0.0286690.0
2020-10-06<s>BUZZ-Logitech shares slip on reports of App...115.800003113.1600040.0169671.0
2020-10-07<s>UPDATE 6-U.S. lawmakers detail Big Tech's m...114.699997115.080002-0.0009560.0
2020-10-08<s>Pomerantz Law Firm Achieves Victory on Beha...116.199997114.9700010.0173961.0
2020-10-09<s>Refinitiv Newscasts - Blockchain Interviews...115.279999116.9700010.0635211.0
2020-10-12<s>PRESS DIGEST- Financial Times - Oct. 12 APP...120.019997124.400002-0.0265270.0
2020-10-13<s>APPLE INC SAYS EVERY NEW IPHONE WILL FEATUR...125.320000121.0999980.0007431.0
ABBV2020-10-01<s>Public coverage of AbbVie's SKYRIZI® for th...88.00000087.139999-0.0117050.0
2020-10-02<s>AbbVie to Present New Data From 15 Abstract...86.50000086.1200030.0210171.0
2020-10-05<s>SHAREHOLDER ACTION REMINDER: The Schall Law...86.48999887.930000-0.0232000.0
2020-10-06<s>AbbVie to Host Third-Quarter 2020 Earnings ...87.97000185.8899990.0137391.0
2020-10-07<s>Allergan Aesthetics, an AbbVie Company, Acq...86.11000187.0700000.0031011.0
2020-10-08<s>Allergan, an AbbVie Company, and Von Miller...87.27999987.3399960.0041221.0
2020-10-09<s>Allergan Aesthetics To Present Data From 4 ...87.41000487.6999970.0070701.0
2020-10-12<s>AbbVie - Allergan Aesthetics to Present Dat...88.19999788.320000-0.0055480.0
2020-10-13<s>NIH SAYS TRIAL WILL TEST RISANKIZUMAB, IN C...87.91999887.830002-0.0200390.0
ABC2020-10-01<s>BRIEF-HHS Says Gilead Anticipates Producing...97.12999795.3600010.0002101.0
2020-10-05<s>AmerisourceBergen Corporation - ION & IPN S...95.76999796.389999-0.0078850.0
2020-10-06<s>AmerisourceBergen Announces Date and Time f...97.19999795.629997-0.0004180.0
2020-10-12<s>AMERISOURCEBERGEN CORP <ABC.N> : EVERCORE I...97.37999796.8600010.0042331.0
2020-10-13<s>MWI Animal Health Selects CVP Impack Automa...96.19999797.2699970.0120281.0
ABMD2020-10-08<s>TCT Connect to Highlight How Impella Enable...271.000000271.2000120.0087021.0
2020-10-09<s>ABIOMED INC -- SC 13G/A ABIOMED - TCT Conne...273.299988273.5599980.0156821.0
2020-10-12<s>ABIOMED INC <ABMD.O> : SVB LEERINK CUTS TAR...274.799988277.850006-0.0032390.0
ABT2020-10-01<s>Refinitiv Newscasts - Navigating the econom...109.180000108.639999-0.0196980.0
2020-10-02<s>Abbott's Libre 3 Receives CE Mark, Boosts C...107.680000106.5000000.0193431.0
2020-10-05<s>NYSE ORDER IMBALANCE <ABT.N> 227300.0 SHARE...107.180000108.559998-0.0212790.0
2020-10-06<s>CANADA HAS SIGNED A DEAL WITH ABBOTT RAPID ...108.559998106.2500000.0140241.0
2020-10-07<s>ABBOTT LABORATORIES <ABT.N> : WELLS FARGO R...107.360001107.7399980.0074251.0
2020-10-08<s>Allergy Immunotherapies Market - Actionable...108.199997108.5400010.0102271.0
2020-10-09<s>ABBOTT LABORATORIES <ABT.N> : JP MORGAN RAI...109.339996109.6500020.0127681.0
2020-10-12<s>REG - Allergy Therapeutics - Holding(s) in...110.250000111.050003-0.0241330.0
\n", "
" ], "text/plain": [ " headline \\\n", "ticker time \n", "A 2020-10-01 AGILENT TECHNOLOGIES, INC. -- S-8<\\s \n", " 2020-10-02 The Replacement Of Sampler Handler Assembly... \n", " 2020-10-08 Agilent Measurement Suite celebrates early ... \n", " 2020-10-09 Spare Parts For Equipment Manufactured By A... \n", " 2020-10-13 NYSE ORDER IMBALANCE 50070.0 SHARES O... \n", "AAL 2020-10-01 American Airlines says it will begin furlou... \n", " 2020-10-02 Refinitiv Newscasts - Airlines push ahead w... \n", " 2020-10-05 American Airlines Adds Costa Rica to Prefli... \n", " 2020-10-06 AMERICAN AIRLINES DELAYS PLAN TO START BOEI... \n", " 2020-10-07 BUZZ-U.S. airlines rebound as Trump pushes ... \n", " 2020-10-08 BUZZ-U.S. STOCKS ON THE MOVE-Eaton Vance, I... \n", " 2020-10-09 MCCONNELL SAYS AIRLINE AID SHOULD BE PART O... \n", " 2020-10-12 Refinitiv Newscasts - Flying during the COV... \n", "AAP 2020-10-01 Hot Shot’s Secret Everyday Diesel Treatment... \n", " 2020-10-02 Advance Auto Parts Names Mann + Hummel 2020... \n", " 2020-10-06 ADVANCE AUTO PARTS INC : JP MORGAN R... \n", "AAPL 2020-10-01 Zuckerberg eyes augmented reality as the fu... \n", " 2020-10-02 Apple working on foldable iPhone with 'self... \n", " 2020-10-05 Apple, Google tracing apps limited Google, ... \n", " 2020-10-06 BUZZ-Logitech shares slip on reports of App... \n", " 2020-10-07 UPDATE 6-U.S. lawmakers detail Big Tech's m... \n", " 2020-10-08 Pomerantz Law Firm Achieves Victory on Beha... \n", " 2020-10-09 Refinitiv Newscasts - Blockchain Interviews... \n", " 2020-10-12 PRESS DIGEST- Financial Times - Oct. 12 APP... \n", " 2020-10-13 APPLE INC SAYS EVERY NEW IPHONE WILL FEATUR... \n", "ABBV 2020-10-01 Public coverage of AbbVie's SKYRIZI® for th... \n", " 2020-10-02 AbbVie to Present New Data From 15 Abstract... \n", " 2020-10-05 SHAREHOLDER ACTION REMINDER: The Schall Law... \n", " 2020-10-06 AbbVie to Host Third-Quarter 2020 Earnings ... \n", " 2020-10-07 Allergan Aesthetics, an AbbVie Company, Acq... \n", " 2020-10-08 Allergan, an AbbVie Company, and Von Miller... \n", " 2020-10-09 Allergan Aesthetics To Present Data From 4 ... \n", " 2020-10-12 AbbVie - Allergan Aesthetics to Present Dat... \n", " 2020-10-13 NIH SAYS TRIAL WILL TEST RISANKIZUMAB, IN C... \n", "ABC 2020-10-01 BRIEF-HHS Says Gilead Anticipates Producing... \n", " 2020-10-05 AmerisourceBergen Corporation - ION & IPN S... \n", " 2020-10-06 AmerisourceBergen Announces Date and Time f... \n", " 2020-10-12 AMERISOURCEBERGEN CORP : EVERCORE I... \n", " 2020-10-13 MWI Animal Health Selects CVP Impack Automa... \n", "ABMD 2020-10-08 TCT Connect to Highlight How Impella Enable... \n", " 2020-10-09 ABIOMED INC -- SC 13G/A ABIOMED - TCT Conne... \n", " 2020-10-12 ABIOMED INC : SVB LEERINK CUTS TAR... \n", "ABT 2020-10-01 Refinitiv Newscasts - Navigating the econom... \n", " 2020-10-02 Abbott's Libre 3 Receives CE Mark, Boosts C... \n", " 2020-10-05 NYSE ORDER IMBALANCE 227300.0 SHARE... \n", " 2020-10-06 CANADA HAS SIGNED A DEAL WITH ABBOTT RAPID ... \n", " 2020-10-07 ABBOTT LABORATORIES : WELLS FARGO R... \n", " 2020-10-08 Allergy Immunotherapies Market - Actionable... \n", " 2020-10-09 ABBOTT LABORATORIES : JP MORGAN RAI... \n", " 2020-10-12 REG - Allergy Therapeutics - Holding(s) in... \n", "\n", " Open Close returns label \n", "ticker time \n", "A 2020-10-01 101.769997 101.220001 -0.011954 0.0 \n", " 2020-10-02 100.209999 100.010002 0.031097 1.0 \n", " 2020-10-08 104.199997 104.160004 0.015361 1.0 \n", " 2020-10-09 104.830002 105.760002 -0.003120 0.0 \n", " 2020-10-13 105.720001 105.419998 -0.003415 0.0 \n", "AAL 2020-10-01 12.410000 12.580000 0.033386 1.0 \n", " 2020-10-02 12.050000 13.000000 0.009231 1.0 \n", " 2020-10-05 13.090000 13.120000 -0.044970 0.0 \n", " 2020-10-06 13.270000 12.530000 0.043097 1.0 \n", " 2020-10-07 12.960000 13.070000 0.006121 1.0 \n", " 2020-10-08 13.370000 13.150000 0.003802 1.0 \n", " 2020-10-09 13.250000 13.200000 -0.021212 0.0 \n", " 2020-10-12 13.140000 12.920000 -0.054180 0.0 \n", "AAP 2020-10-01 154.639999 154.850006 0.004198 1.0 \n", " 2020-10-02 152.240005 155.500000 0.002315 1.0 \n", " 2020-10-06 160.259995 155.160004 0.010634 1.0 \n", "AAPL 2020-10-01 117.629997 116.790001 -0.032280 0.0 \n", " 2020-10-02 112.889999 113.019997 0.030791 1.0 \n", " 2020-10-05 113.910004 116.500000 -0.028669 0.0 \n", " 2020-10-06 115.800003 113.160004 0.016967 1.0 \n", " 2020-10-07 114.699997 115.080002 -0.000956 0.0 \n", " 2020-10-08 116.199997 114.970001 0.017396 1.0 \n", " 2020-10-09 115.279999 116.970001 0.063521 1.0 \n", " 2020-10-12 120.019997 124.400002 -0.026527 0.0 \n", " 2020-10-13 125.320000 121.099998 0.000743 1.0 \n", "ABBV 2020-10-01 88.000000 87.139999 -0.011705 0.0 \n", " 2020-10-02 86.500000 86.120003 0.021017 1.0 \n", " 2020-10-05 86.489998 87.930000 -0.023200 0.0 \n", " 2020-10-06 87.970001 85.889999 0.013739 1.0 \n", " 2020-10-07 86.110001 87.070000 0.003101 1.0 \n", " 2020-10-08 87.279999 87.339996 0.004122 1.0 \n", " 2020-10-09 87.410004 87.699997 0.007070 1.0 \n", " 2020-10-12 88.199997 88.320000 -0.005548 0.0 \n", " 2020-10-13 87.919998 87.830002 -0.020039 0.0 \n", "ABC 2020-10-01 97.129997 95.360001 0.000210 1.0 \n", " 2020-10-05 95.769997 96.389999 -0.007885 0.0 \n", " 2020-10-06 97.199997 95.629997 -0.000418 0.0 \n", " 2020-10-12 97.379997 96.860001 0.004233 1.0 \n", " 2020-10-13 96.199997 97.269997 0.012028 1.0 \n", "ABMD 2020-10-08 271.000000 271.200012 0.008702 1.0 \n", " 2020-10-09 273.299988 273.559998 0.015682 1.0 \n", " 2020-10-12 274.799988 277.850006 -0.003239 0.0 \n", "ABT 2020-10-01 109.180000 108.639999 -0.019698 0.0 \n", " 2020-10-02 107.680000 106.500000 0.019343 1.0 \n", " 2020-10-05 107.180000 108.559998 -0.021279 0.0 \n", " 2020-10-06 108.559998 106.250000 0.014024 1.0 \n", " 2020-10-07 107.360001 107.739998 0.007425 1.0 \n", " 2020-10-08 108.199997 108.540001 0.010227 1.0 \n", " 2020-10-09 109.339996 109.650002 0.012768 1.0 \n", " 2020-10-12 110.250000 111.050003 -0.024133 0.0 " ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[idx[:,'2020-10-01':], ].head(50)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "MultiIndex: 110905 entries, ('A', Timestamp('2019-07-11 00:00:00')) to ('ZTS', Timestamp('2020-10-12 00:00:00'))\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 headline 110905 non-null object \n", " 1 Open 110905 non-null float64\n", " 2 Close 110905 non-null float64\n", " 3 returns 110905 non-null float64\n", " 4 label 110905 non-null float64\n", "dtypes: float64(4), object(1)\n", "memory usage: 4.7+ MB\n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
Open110905.0179.9496531195.9368223.22000041.50999880.650002141.72999629440.470703
Close110905.0179.9708241195.7482593.02000041.47000180.570000141.72999629423.310547
returns110905.00.0005670.032472-0.998369-0.0113050.0009740.0125530.429971
label110905.00.5270460.4992700.0000000.0000001.0000001.0000001.000000
n_Characters110905.0297.054542298.79255611.00000078.000000164.000000390.0000001006.000000
\n", "
" ], "text/plain": [ " count mean std min 25% \\\n", "Open 110905.0 179.949653 1195.936822 3.220000 41.509998 \n", "Close 110905.0 179.970824 1195.748259 3.020000 41.470001 \n", "returns 110905.0 0.000567 0.032472 -0.998369 -0.011305 \n", "label 110905.0 0.527046 0.499270 0.000000 0.000000 \n", "n_Characters 110905.0 297.054542 298.792556 11.000000 78.000000 \n", "\n", " 50% 75% max \n", "Open 80.650002 141.729996 29440.470703 \n", "Close 80.570000 141.729996 29423.310547 \n", "returns 0.000974 0.012553 0.429971 \n", "label 1.000000 1.000000 1.000000 \n", "n_Characters 164.000000 390.000000 1006.000000 " ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['n_Characters'] = data['headline'].str.len()\n", "data.describe().T" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((110905,), (110905,))" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape, y.shape" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [], "source": [ "# Train / Test / Infer: Test which is Validation here: 2019-07-08--->2019-08-01, Infer on \n", "X_train, X_test, X_infer = X.loc[idx[:,'2019-08-01':'2020-10-01'], ], X.loc[idx[:,:'2019-08-01'], ], X.loc[idx[:,'2020-10-01':], ]\n", "y_train, y_test, y_infer = y.loc[idx[:,'2019-08-01':'2020-10-01'], ], y.loc[idx[:,:'2019-08-01'], ], y.loc[idx[:,'2020-10-01':], ]" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((103065,), (5697,), (103065,), (5697,), (2874,), (2874,))" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape, X_test.shape, y_train.shape, y_test.shape, X_infer.shape, y_infer.shape" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "y labels:\n", " 1.0 58452\n", "0.0 52453\n", "Name: label, dtype: int64\n", "Train Labels:\n", " 1.0 54331\n", "0.0 48734\n", "Name: label, dtype: int64\n", "Test Labels:\n", " 0.0 3009\n", "1.0 2688\n", "Name: label, dtype: int64\n", "Inference Labels:\n", " 1.0 1753\n", "0.0 1121\n", "Name: label, dtype: int64\n", "Train Ratio 0.93\n" ] } ], "source": [ "print('y labels:\\n', y.value_counts())\n", "print('Train Labels:\\n', y_train.value_counts())\n", "print('Test Labels:\\n', y_test.value_counts())\n", "print('Inference Labels:\\n', y_infer.value_counts())\n", "print(\"Train Ratio %.2f\" % (len(X_train)/ len(X)))" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "# ENCODE ORDINAL\n", "X_train = [i.encode('utf-8') for i in X_train]\n", "X_test = [i.encode('utf-8') for i in X_test]\n", "X_infer = [i.encode('utf-8') for i in X_infer]\n", "# X_encode = encode2bytes(X)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[b'Agilent Settles Key Intellectual Property Case in China AGILENT SETTLES KEY INTELLECTUAL PROPERTY CASE IN CHINA Agilent Settles Key Intellectual Property Case in China EChrom and Pannatek Admit to using Agilent Technology Without Permission, Agree to Pay Damages and Cease Using Agilent Technology AGILENT TECHNOLOGIES- CO REACHED AGREEMENT WITH ECHROM AND PANNATEK, AS WELL AS CERTAIN FORMER AGILENT EMPLOYEES, REGARDING AN INTELLECTUAL PROPERTY DISPUTE BRIEF-Agilent Settles Key Intellectual Property Case In China Agilent Companion Diagnostic Gains Expanded FDA Approval in Esophageal Squamous Cell Carcinoma<\\\\s',\n", " b'NYSE ORDER IMBALANCE 65800.0 SHARES ON BUY SIDE<\\\\s',\n", " b'AGILENT TECHNOLOGIES INC SEC Filings files Form -- 4<\\\\s',\n", " b'Turbo Pump Controller (model No. X3506 64002 Make Agilent NYSE ORDER IMBALANCE 51900.0 SHARES ON SELL SIDE<\\\\s',\n", " b'Supplying Of Spares With Fitting And Fixing For 70 Nos. Tri-cycle Paddle Van In Ward Nos.-63,64,65,66 & 67, Under A.d / Br.-vii / Swm-i. Agilent Technologies (A) Earnings Expected to Grow: Should You Buy?<\\\\s']" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train[0:5]" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "#Tokenize\n", "import json\n", "with open('Tokenizer.json', encoding='utf-8') as f:\n", " data = json.load(f)\n", " tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(data)\n", "with open('index2char.json', encoding='utf-8') as f:\n", " index2char = json.load(f)\n", "char2index = dict((int(v),int(k)) for k,v in index2char.items())\n", "tokenizer.word_index = char2index\n", "# with open('index2char.json', 'w', encoding='utf-8') as f:\n", "# json.dump(index2char, f, ensure_ascii=False, indent=4) \n" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "X_train = tokenizer.texts_to_sequences(X_train)\n", "X_test = tokenizer.texts_to_sequences(X_test)\n", "X_infer = tokenizer.texts_to_sequences(X_infer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to convert back to readable form:" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0 0 0 ... 31 63 86]\n", "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000Agilent Settles Key Intellectual Property Case in China AGILENT SETTLES KEY INTELLECTUAL PROPERTY CASE IN CHINA Agilent Settles Key Intellectual Property Case in China EChrom and Pannatek Admit to using Agilent Technology Without Permission, Agree to Pay Damages and Cease Using Agilent Technology AGILENT TECHNOLOGIES- CO REACHED AGREEMENT WITH ECHROM AND PANNATEK, AS WELL AS CERTAIN FORMER AGILENT EMPLOYEES, REGARDING AN INTELLECTUAL PROPERTY DISPUTE BRIEF-Agilent Settles Key Intellectual Property Case In China Agilent Companion Diagnostic Gains Expanded FDA Approval in Esophageal Squamous Cell Carcinoma<\\s\n" ] } ], "source": [ "print(X_train[0])\n", "print(bytes(list(map(index2char.get, X_train[0]))).decode('utf-8'))" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Headlines length: \n", "Mean 297.05 words (Std 298.79) max 1006\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Summarize headlines length\n", "print(\"Headlines length: \")\n", "result = [len(sentence) for sentence in X]\n", "print(\"Mean %.2f words (Std %.2f) max %d\" % (np.mean(result), np.std(result), max(result)))\n", "# plot review length\n", "plt.figure(figsize=(20,10));\n", "plt.boxplot(result)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((103065, 1116), (5697, 1023), (2874, 1032))" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# prepadded to solve for forward propagation not masking\n", "max_sentence_len = max(map(len, X))\n", "X_train = pad_sequences(X_train, maxlen = max(map(len, X_train)), padding = 'pre', truncating='pre')\n", "X_test = pad_sequences(X_test, maxlen = max(map(len, X_test)), padding = 'pre', truncating='pre')\n", "X_infer = pad_sequences(X_infer, maxlen = max(map(len, X_infer)), padding = 'pre', truncating='pre')\n", "X_train.shape, X_test.shape, X_infer.shape\n", "# X_padded = pad_sequences(X_encode, maxlen = max_sentence_len, padding = 'pre', truncating='pre')" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((103065, 1), (5697, 1), (2874, 1))" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Our vectorized labels\n", "train_labels = np.asarray(y_train).astype('float32').reshape((-1,1))\n", "test_labels = np.asarray(y_test).astype('float32').reshape((-1,1))\n", "infer_labels = np.asarray(y_infer).astype('float32').reshape((-1,1))\n", "train_labels.shape, test_labels.shape, infer_labels.shape" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "train_seq_data = tf.data.Dataset.from_tensor_slices((X_train,train_labels))\n", "test_seq_data = tf.data.Dataset.from_tensor_slices((X_test,test_labels))\n", "infer_seq_data = tf.data.Dataset.from_tensor_slices((X_infer,infer_labels))" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "AUTOTUNE = tf.data.experimental.AUTOTUNE\n", "\n", "def configure_dataset(dataset):\n", " return dataset.cache().prefetch(buffer_size=AUTOTUNE)" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Set Shape: \n", "Test Set Shape: \n" ] } ], "source": [ "batch_size = 64\n", "\n", "train_seq_data = train_seq_data.batch(batch_size, drop_remainder=True)\n", "test_seq_data = test_seq_data.batch(batch_size, drop_remainder=True)\n", "infer_seq_data = infer_seq_data.batch(64, drop_remainder=True)\n", "print('Train Set Shape: ', train_seq_data, '\\nTest Set Shape: ', test_seq_data)" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [], "source": [ "# with tf.device('CPU'):\n", "train_seq_data = configure_dataset(train_seq_data)\n", "test_seq_data = configure_dataset(test_seq_data)\n", "infer_seq_data = configure_dataset(infer_seq_data)" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(,\n", " ,\n", " )" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_seq_data, test_seq_data, infer_seq_data" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"CharLSTM\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "EmbedLayer (Embedding) (128, None, 256) 47360 \n", "_________________________________________________________________\n", "BiLSTM (Bidirectional) (128, None, 2048) 10493952 \n", "_________________________________________________________________\n", "time_distributed (TimeDistri (None, None, 185) 379065 \n", "=================================================================\n", "Total params: 10,920,377\n", "Trainable params: 10,920,377\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "# def create_language_model(batch_size):\n", "# model = Sequential(name = 'CharLSTM')\n", "# model.add(Embedding(127, 256,batch_input_shape=[batch_size, None], \n", "# mask_zero=True, name ='EmbedLayer'))\n", "# model.add(Bidirectional(LSTM(1024, return_sequences=True,stateful=False,\n", "# recurrent_initializer='glorot_uniform'), merge_mode ='ave',name = 'BiLSTM'))\n", "# model.add(TimeDistributed(Dense(127, name = 'TimeDistDense')))\n", "# model.compile(optimizer=tf.optimizers.SGD(learning_rate=1e-1), \n", "# loss = tf.losses.SparseCategoricalCrossentropy(from_logits = True))\n", "# return model\n", "# #Compile then load weights\n", "# checkpoint_dir = './training_checkpoints_CharWeights'\n", "\n", "# ChaRmodel = create_language_model(batch_size=None)\n", "\n", "# print(tf.train.latest_checkpoint(checkpoint_dir))\n", "# ChaRmodel.load_weights(tf.train.latest_checkpoint(checkpoint_dir))\n", "\n", "# ChaRmodel.build(tf.TensorShape([1,None]))\n", "\n", "# # Get layers to intialize classification model\n", "# embeddings = ChaRmodel.layers[0].get_weights()[0]\n", "# lstm = ChaRmodel.layers[1].get_weights()[0]\n", "# print(embeddings.shape, lstm.shape)\n", "\n", "ChaRmodel = tf.keras.models.load_model('CharLM.h5')\n", "ChaRmodel.summary()" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "185" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(char2index)" ] }, { "cell_type": "raw", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [], "source": [ "def direction_model():\n", " model = Sequential(name = 'RNNStocks')\n", " model.add(Embedding(input_dim = 185, output_dim = 256,batch_input_shape=[None, None],\n", " mask_zero = True, name ='EmbedLayer'))\n", " model.add(Bidirectional(LSTM(1024,\n", " return_sequences=False,stateful=False, \n", " recurrent_initializer='glorot_uniform'), merge_mode ='concat',name = 'BiLSTM'))\n", " #final state encodes full representation of a single passed headine\n", " model.add(BatchNormalization(name='BatchNormal')) #After RNN(S-shape activation-f(x) / Before ReLU(Non-Gaussian))\n", "# model.add(tf.keras.layers.Masking(mask_value=0))\n", " model.add(Dense(512, name = 'FullConnected', kernel_initializer='he_normal')) \n", " model.add(tf.keras.layers.LeakyReLU()) #controls vanishing gradients:f(x) = a * (exp(x) - 1.) for x < 0 ; f(x) = x for x >= 0\n", " model.add(BatchNormalization(name='BatchNormal2'))\n", " model.add(Dense(1, activation='sigmoid',name='Output'))\n", " model.compile(optimizer=tf.optimizers.Adadelta(learning_rate = 1e-04), loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy', tf.keras.metrics.AUC(name='AUC')])\n", " return model" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "# previous_training = tf.keras.models.load_model('daily.h5').get_weights()" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [], "source": [ " # batch size of 128 headlines is used to space out weight updates for our large data\n", "epochs = 100\n", "# # with strategy.scope():\n", "# gpus = tf.config.experimental.list_logical_devices('GPU')\n", "# if gpus:\n", "# # Replicate your computation on multiple GPUs\n", "# for gpu in gpus:\n", "with tf.device('GPU:1'):\n", " model = direction_model()\n", " checkpoint_dir = './training_checkpoints_CharWeights_V2'\n", " # print(tf.train.latest_checkpoint(checkpoint_dir))\n", " # model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))\n", " model.layers[0].set_weights(ChaRmodel.layers[0].get_weights())\n", " model.layers[1].set_weights(ChaRmodel.layers[1].get_weights())" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [], "source": [ "#Freeze Language Model Layers for initial training of Added Layers:\n", "# Freeze all the layers before the `fine_tune_at` layer\n", "for layer in model.layers[:2]:\n", " layer.trainable = False" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"RNNStocks\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "EmbedLayer (Embedding) (None, None, 256) 47360 \n", "_________________________________________________________________\n", "BiLSTM (Bidirectional) (None, 2048) 10493952 \n", "_________________________________________________________________\n", "BatchNormal (BatchNormalizat (None, 2048) 8192 \n", "_________________________________________________________________\n", "FullConnected (Dense) (None, 512) 1049088 \n", "_________________________________________________________________\n", "leaky_re_lu (LeakyReLU) (None, 512) 0 \n", "_________________________________________________________________\n", "BatchNormal2 (BatchNormaliza (None, 512) 2048 \n", "_________________________________________________________________\n", "Output (Dense) (None, 1) 513 \n", "=================================================================\n", "Total params: 11,601,153\n", "Trainable params: 1,054,721\n", "Non-trainable params: 10,546,432\n", "_________________________________________________________________\n" ] } ], "source": [ "model.summary()" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n", "True\n", "True\n", "True\n", "True\n", "True\n" ] } ], "source": [ "print(np.all(model.get_layer('EmbedLayer').get_weights()[0] == ChaRmodel.get_layer('EmbedLayer').get_weights()[0] ))\n", "for i in range(0, len(model.get_layer('BiLSTM').get_weights())):\n", " print(np.all(model.get_layer('BiLSTM').get_weights()[i] == ChaRmodel.get_layer('BiLSTM').get_weights()[i]))" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[True, True, True, True, True, True, True]\n", "[True, True, True]\n" ] } ], "source": [ "print([layer.supports_masking for layer in model.layers])\n", "print([layer.supports_masking for layer in ChaRmodel.layers])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test Model Pre training it" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSFT rallying after last earnings call<\\s\n", "[[31], [86], [33], [48], [54], [41], [55], [3], [85], [68], [79], [79], [92], [76], [81], [74], [3], [68], [73], [87], [72], [85], [3], [79], [68], [86], [87], [3], [72], [68], [85], [81], [76], [81], [74], [86], [3], [70], [68], [79], [79], [31], [63], [86]]\n" ] }, { "data": { "text/plain": [ "(1, 44)" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample = 'MSFT rallying after last earnings call' \n", "sample = '' + sample + '<\\s' \n", "print(sample)\n", "sample = [i.encode('utf-8') for i in sample]\n", "sample = tokenizer.texts_to_sequences(sample)\n", "print(sample)\n", "sample = tf.squeeze(sample)\n", "sample = tf.expand_dims(sample, 0).numpy()\n", "sample.shape" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSFT rallying after last earnings call<\\s\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\n", "[[31], [86], [33], [48], [54], [41], [55], [3], [85], [68], [79], [79], [92], [76], [81], [74], [3], [68], [73], [87], [72], [85], [3], [79], [68], [86], [87], [3], [72], [68], [85], [81], [76], [81], [74], [86], [3], [70], [68], [79], [79], [31], [63], [86], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]]\n" ] }, { "data": { "text/plain": [ "(1, 64)" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "padded_sample = 'MSFT rallying after last earnings call' \n", "padded_sample = '' + padded_sample + '<\\s' + chr(0) * 20\n", "print(padded_sample)\n", "padded_sample = [i.encode('utf-8') for i in padded_sample]\n", "padded_sample = tokenizer.texts_to_sequences(padded_sample)\n", "print(padded_sample)\n", "padded_sample = tf.squeeze(padded_sample)\n", "padded_sample = tf.expand_dims(padded_sample, 0).numpy()\n", "padded_sample.shape" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.3163721]]\n", "vs\n", "[[0.3163721]]\n" ] } ], "source": [ "print(model(sample).numpy())\n", "print('vs')\n", "print(model(padded_sample).numpy())" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6294831119673949" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1/(1+np.exp(-0.53))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute PreTraining Baseline Metrics" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "89/89 [==============================] - 50s 560ms/step - loss: 0.7115 - accuracy: 0.5283 - AUC: 0.5113\n" ] } ], "source": [ "# TEST DATA\n", "initial_epochs = 10\n", "validation_steps= int(X_test.shape[0] / batch_size)\n", "\n", "initial_loss, initial_accuracy, intial_AUC = model.evaluate(test_seq_data, steps = validation_steps)" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initial loss: 0.71 | initial Accuracy : 52.83% | Initial AUC: 51.13%\n" ] } ], "source": [ "print(f'Initial loss: {initial_loss:.2f} | initial Accuracy : {initial_accuracy:.2%} | Initial AUC: {intial_AUC:.2%}')" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44/44 [==============================] - 24s 556ms/step - loss: 0.6721 - accuracy: 0.3899 - AUC: 0.5058\n" ] } ], "source": [ "# VALIDATION DATA\n", "validation_step_2 = int(X_infer.shape[0] / batch_size)\n", "\n", "initial_loss, initial_accuracy, intial_AUC = model.evaluate(infer_seq_data, steps = validation_step_2)" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initial loss: 0.67 | initial Accuracy : 38.99% | Initial AUC: 50.58%\n" ] } ], "source": [ "print(f'Initial loss: {initial_loss:.2f} | initial Accuracy : {initial_accuracy:.2%} | Initial AUC: {intial_AUC:.2%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Warm Start Training\n", "* Pretrained weights should be freezed, for the intial tuning of new added final layers to avoid large gradient updates that can eliminate the pretrained results from the language model." ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [], "source": [ "early_stopping = EarlyStopping(monitor='val_accuracy',\n", " patience=5,\n", " mode='max',\n", " restore_best_weights=True)\n", "csv_logs = tf.keras.callbacks.CSVLogger('./daily_log.csv', separator=\",\", append=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/30\n", "1610/1610 [==============================] - 4016s 2s/step - loss: 0.7166 - accuracy: 0.5018 - AUC: 0.5018 - val_loss: 0.7476 - val_accuracy: 0.5065 - val_AUC: 0.5058\n", "Epoch 2/30\n", "1610/1610 [==============================] - 3998s 2s/step - loss: 0.7149 - accuracy: 0.5005 - AUC: 0.5017 - val_loss: 0.7449 - val_accuracy: 0.5056 - val_AUC: 0.5013\n", "Epoch 3/30\n", "1610/1610 [==============================] - 4019s 2s/step - loss: 0.7138 - accuracy: 0.5011 - AUC: 0.5027 - val_loss: 0.7428 - val_accuracy: 0.4989 - val_AUC: 0.5009\n", "Epoch 4/30\n", "1610/1610 [==============================] - 3975s 2s/step - loss: 0.7132 - accuracy: 0.5004 - AUC: 0.5032 - val_loss: 0.7414 - val_accuracy: 0.5037 - val_AUC: 0.5032\n", "Epoch 5/30\n", " 530/1610 [========>.....................] - ETA: 43:40 - loss: 0.7106 - accuracy: 0.5076 - AUC: 0.5121" ] } ], "source": [ "history = model.fit(train_seq_data,\n", " epochs=30,\n", " validation_data = test_seq_data,\n", " callbacks=[early_stopping, csv_logs])" ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [], "source": [ "def plot_learning_curves(df):\n", " fig, axes = plt.subplots(ncols=2, figsize=(15, 4))\n", " df[['loss', 'val_loss']].plot(ax=axes[0], title='Cross-Entropy')\n", " df[['accuracy', 'val_accuracy']].plot(ax=axes[1], title='Accuracy')\n", " for ax in axes:\n", " ax.legend(['Training', 'Validation'])\n", " sns.despine() \n", " fig.tight_layout();" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "metrics = pd.DataFrame(history.history)" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_learning_curves(metrics)" ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "89/89 [==============================] - 51s 568ms/step - loss: 0.7476 - accuracy: 0.5065 - AUC: 0.5058\n" ] } ], "source": [ "second_loss, second_accuracy, second_AUC = model.evaluate(test_seq_data, steps = validation_steps)" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Transfer loss: 0.75 | Transfer Accuracy : 50.65% | Transfer AUC: 50.58%\n" ] } ], "source": [ "print(f'Transfer loss: {second_loss:.2f} | Transfer Accuracy : {second_accuracy:.2%} | Transfer AUC: {second_AUC:.2%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fine Tune all Layers for Target Task" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [], "source": [ "model.trainable = True" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"RNNStocks\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "EmbedLayer (Embedding) (None, None, 256) 47360 \n", "_________________________________________________________________\n", "BiLSTM (Bidirectional) (None, 2048) 10493952 \n", "_________________________________________________________________\n", "BatchNormal (BatchNormalizat (None, 2048) 8192 \n", "_________________________________________________________________\n", "FullConnected (Dense) (None, 512) 1049088 \n", "_________________________________________________________________\n", "leaky_re_lu (LeakyReLU) (None, 512) 0 \n", "_________________________________________________________________\n", "BatchNormal2 (BatchNormaliza (None, 512) 2048 \n", "_________________________________________________________________\n", "Output (Dense) (None, 1) 513 \n", "=================================================================\n", "Total params: 11,601,153\n", "Trainable params: 11,596,033\n", "Non-trainable params: 5,120\n", "_________________________________________________________________\n" ] } ], "source": [ "model.summary()" ] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [], "source": [ "# Name of the checkpoint files and save each weights at each epoch\n", "# checkpoint_dir = './training_Daily'\n", "# checkpoint_prefix = os.path.join(checkpoint_dir, \"daily.h5\")\n", "checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(\n", " filepath='daily.h5',\n", " verbose=1,\n", " monitor='val_accuracy',\n", " mode='max',\n", " save_best_only=True)\n", "\n", "early_stopping = EarlyStopping(monitor='val_AUC',\n", " patience = 20,\n", " mode='max',\n", " restore_best_weights=True)\n", "csv_logs = tf.keras.callbacks.CSVLogger('./daily_log.csv', separator=\",\", append=True)" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [], "source": [ "base_learning_rate = 0.0001\n", "model.compile(optimizer =tf.keras.optimizers.Adam(lr=base_learning_rate / 10), loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy', tf.keras.metrics.AUC(name='AUC')])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 7/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.7074 - accuracy: 0.4907 - AUC: 0.5021\n", "Epoch 00007: val_accuracy improved from -inf to 0.52844, saving model to daily.h5\n", "1610/1610 [==============================] - 3972s 2s/step - loss: 0.7074 - accuracy: 0.4907 - AUC: 0.5021 - val_loss: 0.6993 - val_accuracy: 0.5284 - val_AUC: 0.5046\n", "Epoch 8/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.7042 - accuracy: 0.4879 - AUC: 0.5028\n", "Epoch 00008: val_accuracy did not improve from 0.52844\n", "1610/1610 [==============================] - 3991s 2s/step - loss: 0.7042 - accuracy: 0.4879 - AUC: 0.5028 - val_loss: 0.7096 - val_accuracy: 0.5219 - val_AUC: 0.5014\n", "Epoch 9/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.6987 - accuracy: 0.4828 - AUC: 0.5115\n", "Epoch 00009: val_accuracy did not improve from 0.52844\n", "1610/1610 [==============================] - 3969s 2s/step - loss: 0.6987 - accuracy: 0.4828 - AUC: 0.5115 - val_loss: 0.7056 - val_accuracy: 0.5246 - val_AUC: 0.4984\n", "Epoch 10/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.6947 - accuracy: 0.4804 - AUC: 0.5213\n", "Epoch 00010: val_accuracy did not improve from 0.52844\n", "1610/1610 [==============================] - 4006s 2s/step - loss: 0.6947 - accuracy: 0.4804 - AUC: 0.5213 - val_loss: 0.7006 - val_accuracy: 0.5281 - val_AUC: 0.4954\n", "Epoch 11/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.6925 - accuracy: 0.4790 - AUC: 0.5279\n", "Epoch 00011: val_accuracy did not improve from 0.52844\n", "1610/1610 [==============================] - 3941s 2s/step - loss: 0.6925 - accuracy: 0.4790 - AUC: 0.5279 - val_loss: 0.6978 - val_accuracy: 0.5283 - val_AUC: 0.4954\n", "Epoch 12/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.6912 - accuracy: 0.4786 - AUC: 0.5326\n", "Epoch 00012: val_accuracy did not improve from 0.52844\n", "1610/1610 [==============================] - 3968s 2s/step - loss: 0.6912 - accuracy: 0.4786 - AUC: 0.5326 - val_loss: 0.6960 - val_accuracy: 0.5283 - val_AUC: 0.4930\n", "Epoch 13/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.6903 - accuracy: 0.4785 - AUC: 0.5368\n", "Epoch 00013: val_accuracy did not improve from 0.52844\n", "1610/1610 [==============================] - 3994s 2s/step - loss: 0.6903 - accuracy: 0.4785 - AUC: 0.5368 - val_loss: 0.6998 - val_accuracy: 0.5283 - val_AUC: 0.4959\n", "Epoch 14/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.6896 - accuracy: 0.4791 - AUC: 0.5404\n", "Epoch 00014: val_accuracy did not improve from 0.52844\n", "1610/1610 [==============================] - 4013s 2s/step - loss: 0.6896 - accuracy: 0.4791 - AUC: 0.5404 - val_loss: 0.7024 - val_accuracy: 0.5283 - val_AUC: 0.4965\n", "Epoch 15/94\n", "1610/1610 [==============================] - ETA: 0s - loss: 0.6889 - accuracy: 0.4805 - AUC: 0.5438\n", "Epoch 00015: val_accuracy improved from 0.52844 to 0.53020, saving model to daily.h5\n", "1610/1610 [==============================] - 3971s 2s/step - loss: 0.6889 - accuracy: 0.4805 - AUC: 0.5438 - val_loss: 0.7009 - val_accuracy: 0.5302 - val_AUC: 0.5010\n", "Epoch 16/94\n", "1036/1610 [==================>...........] - ETA: 23:07 - loss: 0.6882 - accuracy: 0.4822 - AUC: 0.5472" ] } ], "source": [ "epochs = 100 - (history.epoch[-1] +1) \n", "start = time.time()\n", "fine_tuned_history = model.fit(train_seq_data, epochs=epochs, initial_epoch=(history.epoch[-1] + 1),\n", " verbose = 1, validation_data=(test_seq_data),\n", " callbacks=[checkpoint_callback, early_stopping, csv_logs])\n", "end = time.time()\n", "print(\"Time took {:3.1f} min\".format((end-start)/60))" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "89/89 [==============================] - 50s 564ms/step - loss: 0.6948 - accuracy: 0.5318 - AUC: 0.5123\n", "Test loss: 0.6948203444480896\n", "Test accuracy: 0.5317766666412354\n", "Test AUC: 0.5122969746589661\n" ] } ], "source": [ "score = model.evaluate((test_seq_data), verbose=1)\n", "print('Test loss:', score[0])\n", "print('Test accuracy:', score[1])\n", "print('Test AUC:', score[2])" ] }, { "cell_type": "code", "execution_count": 201, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44/44 [==============================] - 25s 569ms/step - loss: 0.6873 - accuracy: 0.3913 - AUC: 0.5152\n", "Test loss: 0.6873216032981873\n", "Test accuracy: 0.39133521914482117\n", "Test AUC: 0.5152168273925781\n" ] } ], "source": [ "score = model.evaluate((infer_seq_data), verbose=1)\n", "print('Test loss:', score[0])\n", "print('Test accuracy:', score[1])\n", "print('Test AUC:', score[2])" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [], "source": [ "fine_tuned = pd.DataFrame(fine_tuned_history.history)\n", "df = metrics.append(fine_tuned).reset_index(drop = 1)\n", "df.index +=1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(ncols=3, figsize=(16, 4))\n", "#ACCURACY\n", "df1 = (df[['accuracy', 'val_accuracy']]\n", " .rename(columns={'accuracy': 'Training',\n", " 'val_accuracy': 'Validation'}))\n", "df1.plot(ax=axes[0], title='Accuracy', xlim=(1, len(df)))\n", "axes[0].axvline(df.val_accuracy.idxmax(), ls='--', lw=1, c='k')\n", "axes[0].axvline(len(metrics), ls='-', lw=1, c='k')\n", "#AUC\n", "df2 = (df[['AUC', 'val_AUC']]\n", " .rename(columns={'AUC': 'Training',\n", " 'val_AUC': 'Validation'}))\n", "df2.plot(ax=axes[1], title='Area under the ROC Curve', xlim=(1, len(df)))\n", "\n", "axes[1].axvline(df.val_AUC.idxmax(), ls='--', lw=1, c='k')\n", "axes[1].axvline(len(metrics), ls='-', lw=1, c='k')\n", "#LOSS\n", "df2 = (df[['loss', 'val_loss']]\n", " .rename(columns={'loss': 'Training',\n", " 'val_loss': 'Validation'}))\n", "df2.plot(ax=axes[2], title='Loss', xlim=(1, len(df)))\n", "\n", "axes[2].axvline(df.val_loss.idxmin(), ls='--', lw=1, c='k')\n", "axes[2].axvline(len(metrics), ls='-', lw=1, c='k')\n", "for i in [0, 1, 2]:\n", " axes[i].set_xlabel('Epoch')\n", "\n", "sns.despine()\n", "fig.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# for ax in axes:\n", "# ax.axvline(10, ls='--', lw=1, c='k')\n", "# ax.legend(['Training', 'Validation', 'Start Fine Tuning'])\n", "# ax.set_xlabel('Epoch')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Refit on other seq length" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 4 }