{ "cells": [ { "cell_type": "markdown", "id": "b89da7e6-431a-4659-a5b3-45323d11fd03", "metadata": {}, "source": [ "# Pipelines for NLP Tasks" ] }, { "cell_type": "code", "execution_count": 2, "id": "76d9d1e3-05e1-456d-b564-5d096896a778", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/milindchawre/.pyenv/versions/3.12.2/envs/hugging-face/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import transformers\n", "from transformers import pipeline" ] }, { "cell_type": "code", "execution_count": 3, "id": "cf038c7c-13ce-4231-8acb-2a6b8de67de6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.44.0\n" ] } ], "source": [ "print(transformers.__version__)" ] }, { "cell_type": "markdown", "id": "1dbc51c2-1efd-4811-929a-54ef8424c30c", "metadata": {}, "source": [ "## Loading Tasks\n", "\n", "The task defining which pipeline will be returned. Currently accepted tasks are:\n", " \n", " - `\"audio-classification\"`: will return a [`AudioClassificationPipeline`].\n", " - `\"automatic-speech-recognition\"`: will return a [`AutomaticSpeechRecognitionPipeline`].\n", " - `\"conversational\"`: will return a [`ConversationalPipeline`].\n", " - `\"depth-estimation\"`: will return a [`DepthEstimationPipeline`].\n", " - `\"document-question-answering\"`: will return a [`DocumentQuestionAnsweringPipeline`].\n", " - `\"feature-extraction\"`: will return a [`FeatureExtractionPipeline`].\n", " - `\"fill-mask\"`: will return a [`FillMaskPipeline`]:.\n", " - `\"image-classification\"`: will return a [`ImageClassificationPipeline`].\n", " - `\"image-segmentation\"`: will return a [`ImageSegmentationPipeline`].\n", " - `\"image-to-text\"`: will return a [`ImageToTextPipeline`].\n", " - `\"object-detection\"`: will return a [`ObjectDetectionPipeline`].\n", " - `\"question-answering\"`: will return a [`QuestionAnsweringPipeline`].\n", " - `\"summarization\"`: will return a [`SummarizationPipeline`].\n", " - `\"table-question-answering\"`: will return a [`TableQuestionAnsweringPipeline`].\n", " - `\"text2text-generation\"`: will return a [`Text2TextGenerationPipeline`].\n", " - `\"text-classification\"` (alias `\"sentiment-analysis\"` available): will return a\n", " [`TextClassificationPipeline`].\n", " - `\"text-generation\"`: will return a [`TextGenerationPipeline`]:.\n", " - `\"token-classification\"` (alias `\"ner\"` available): will return a [`TokenClassificationPipeline`].\n", " - `\"translation\"`: will return a [`TranslationPipeline`].\n", " - `\"translation_xx_to_yy\"`: will return a [`TranslationPipeline`].\n", " - `\"video-classification\"`: will return a [`VideoClassificationPipeline`].\n", " - `\"visual-question-answering\"`: will return a [`VisualQuestionAnsweringPipeline`].\n", " - `\"zero-shot-classification\"`: will return a [`ZeroShotClassificationPipeline`].\n", " - `\"zero-shot-image-classification\"`: will return a [`ZeroShotImageClassificationPipeline`].\n", " - `\"zero-shot-object-detection\"`: will return a [`ZeroShotObjectDetectionPipeline`]." ] }, { "cell_type": "markdown", "id": "aaa893bc-026c-456a-99f0-8f56a47da96e", "metadata": { "tags": [] }, "source": [ "## Classification \n", "\n", "### Default Models" ] }, { "cell_type": "code", "execution_count": 4, "id": "c3498239-01f0-49b1-bf3e-d9cba22d41ac", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n", "/Users/milindchawre/.pyenv/versions/3.12.2/envs/hugging-face/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "[{'label': 'POSITIVE', 'score': 0.9998236298561096}]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe = pipeline(task=\"text-classification\",device=0)\n", "pipe(\"This restaurant is ok\")" ] }, { "cell_type": "markdown", "id": "e88492a4-003a-4387-9590-0f3f621278ba", "metadata": {}, "source": [ "### Specific Models\n", "\n", "Perhaps you want to use a different model for different categories or text types, for example, financial news: https://huggingface.co/ProsusAI/finbert\n", "\n", "You can explore more details in the paper: https://arxiv.org/pdf/1908.10063" ] }, { "cell_type": "code", "execution_count": 5, "id": "bcf0966e-892f-46ca-9321-c0b31c9862ab", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", "To disable this warning, you can either:\n", "\t- Avoid using `tokenizers` before the fork if possible\n", "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n", "Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.\n" ] } ], "source": [ "pipe = pipeline(model=\"ProsusAI/finbert\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "fdf85b86-1132-4ef7-80c8-36e729be2910", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'label': 'positive', 'score': 0.9350943565368652}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe(\"Shares of food delivery companies surged despite the catastrophic impact of coronavirus on global markets.\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "6eca0d8e-1d32-4801-945a-34e9e7bbf83d", "metadata": {}, "outputs": [], "source": [ "tweets = ['Gonna buy AAPL, its about to surge up!',\n", " 'Gotta sell AAPL, its gonna plummet!']" ] }, { "cell_type": "code", "execution_count": 8, "id": "bfc8d3e1-d7d4-457f-a61c-25c023f4851a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'label': 'positive', 'score': 0.523411750793457},\n", " {'label': 'neutral', 'score': 0.5528597831726074}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe(tweets)" ] }, { "cell_type": "markdown", "id": "d2e8be7b-2cb7-425d-ae25-fa0b57f67f6a", "metadata": {}, "source": [ "# Named Entity Recognition\n", "\n", "Let's explore another NLP task, such as NER - Named Entity Recognition\n", "\n", "**Note, this is a much larger model! If you run this it will download about 1.5 GB on to your computer inside of a cache folder!**" ] }, { "cell_type": "code", "execution_count": 9, "id": "1486daa6-6108-4179-8fd3-682c17ad8f56", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n", "Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.\n" ] } ], "source": [ "pipe = pipeline(task=\"text-classification\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "293aa275-9fde-4017-9345-ce7a4debf315", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n", "Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']\n", "- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.\n" ] } ], "source": [ "ner_tag_pipe = pipeline('ner')" ] }, { "cell_type": "code", "execution_count": 22, "id": "cbaaa69a-9dab-42a3-a1f5-1b109251c8c6", "metadata": {}, "outputs": [], "source": [ "result = ner_tag_pipe(\"After working at Tesla I started to study Nikola Tesla a lot more, especially at university in the USA.\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "88e7d85a-6811-47a5-b400-2b36eb32e87e", "metadata": {}, "outputs": [], "source": [ "#sentence =\"\"\"After working at Tomtom I started to study AI a lot more, especially at home in the Mumbai, Topics like RAG, hugging face and data science interest me more, Eating food like snacks and packed food with my laptop is my working setup.\"\"\"\n", "#result = ner_tag_pipe(sentence)" ] }, { "cell_type": "code", "execution_count": 24, "id": "1be4c2c4-c1ae-446a-9ce3-680934e7da9c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'entity': 'I-ORG',\n", " 'score': 0.9137765,\n", " 'index': 4,\n", " 'word': 'Te',\n", " 'start': 17,\n", " 'end': 19},\n", " {'entity': 'I-ORG',\n", " 'score': 0.3789888,\n", " 'index': 5,\n", " 'word': '##sla',\n", " 'start': 19,\n", " 'end': 22},\n", " {'entity': 'I-PER',\n", " 'score': 0.99693346,\n", " 'index': 10,\n", " 'word': 'Nikola',\n", " 'start': 42,\n", " 'end': 48},\n", " {'entity': 'I-PER',\n", " 'score': 0.9901416,\n", " 'index': 11,\n", " 'word': 'Te',\n", " 'start': 49,\n", " 'end': 51},\n", " {'entity': 'I-PER',\n", " 'score': 0.8931826,\n", " 'index': 12,\n", " 'word': '##sla',\n", " 'start': 51,\n", " 'end': 54},\n", " {'entity': 'I-LOC',\n", " 'score': 0.9997478,\n", " 'index': 22,\n", " 'word': 'USA',\n", " 'start': 99,\n", " 'end': 102}]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result" ] }, { "cell_type": "markdown", "id": "88b446e2-8b25-4736-932b-4a7570c2570b", "metadata": {}, "source": [ "# Question Answering" ] }, { "cell_type": "code", "execution_count": 25, "id": "2ab7745a-1fe3-4e5d-89cf-8399185acd9d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n", "Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.\n" ] } ], "source": [ "qa_bot = pipeline('question-answering')" ] }, { "cell_type": "code", "execution_count": 26, "id": "4380b8e5-4807-48da-bd3e-65e78c42e967", "metadata": {}, "outputs": [], "source": [ "text = \"\"\"\n", "D-Day, marked on June 6, 1944, stands as one of the most significant military operations in history, \n", "initiating the Allied invasion of Nazi-occupied Europe during World War II. Known as Operation Overlord, \n", "this massive amphibious assault involved nearly 160,000 Allied troops landing on the beaches of Normandy, \n", "France, across five sectors: Utah, Omaha, Gold, Juno, and Sword. Supported by over 5,000 ships and 13,000 \n", "aircraft, the operation was preceded by extensive aerial and naval bombardment and an airborne assault. \n", "The invasion set the stage for the liberation of Western Europe from Nazi control, despite the heavy \n", "casualties and formidable German defenses. This day not only demonstrated the logistical prowess \n", "and courage of the Allied forces but also marked a turning point in the war, leading to the eventual \n", "defeat of Nazi Germany.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 38, "id": "c185e089-0966-44b6-b4d8-882faf504e3c", "metadata": {}, "outputs": [], "source": [ "question = \"What were the five beach sectors on D-Day?\"\n", "\n", "result = qa_bot(question=question,context=text)" ] }, { "cell_type": "code", "execution_count": 36, "id": "720fcd66-2e85-4034-a81a-f49901d2fbb7", "metadata": {}, "outputs": [], "source": [ "#\n", "#question = \"Who is sherlock holmes?\"\n", "#result = qa_bot(question=question,context=text)" ] }, { "cell_type": "code", "execution_count": 39, "id": "8d1918a0-70f3-4d6c-8a17-bac0d07a9761", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'score': 0.9430821537971497,\n", " 'start': 345,\n", " 'end': 379,\n", " 'answer': 'Utah, Omaha, Gold, Juno, and Sword'}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result" ] }, { "cell_type": "markdown", "id": "f536e540-4080-4135-84e8-86fe13dc2fe6", "metadata": {}, "source": [ "## Translations\n", "\n", "Translates from one language to another.\n", "\n", "This translation pipeline can currently be loaded from pipeline() using the following task identifier: \"translation_xx_to_yy\".\n", "\n", "The models that this pipeline can use are models that have been fine-tuned on a translation task. See the up-to-date list of available models on www.huggingface.co/models. \n", "\n", "Note: You would typically call a specific model for translations: https://huggingface.co/models?pipeline_tag=translation" ] }, { "cell_type": "code", "execution_count": 6, "id": "24f282e5-1f0e-44ca-9b92-22b211208274", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).\n", "Using a pipeline without specifying a model name and revision in production is not recommended.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "453587fb51494963bf1c3ba521c405f4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "config.json: 0%| | 0.00/1.21k [00:00