{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "98b2161f-60f5-4d38-98d7-493761c23c72",
   "metadata": {},
   "source": [
    "## Computational - 35 points\n",
    "\n",
    "These problems have a coding solution. Code your solutions in Python in the space provided.\n",
    "\n",
    "Use of LLM's such as Chat GPT are not allowed for this computational task, please try not to use such applications as we do check for the same during grading."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "765eea4a-212a-4d10-84c6-0180108de2ca",
   "metadata": {},
   "source": [
    "### Problem 1 - Matrices [18 points]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a4f56c2-2379-4dee-b064-a8254d4fd8a7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package treebank to\n",
      "[nltk_data]     /Users/Abhishek/nltk_data...\n",
      "[nltk_data]   Package treebank is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "# Import the following libraries, numpy, nltk, treebank from nltk corpus, pandas - 3 Points\n",
    "\n",
    "\n",
    "# Your code here\n",
    "nltk.download('treebank')\n",
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72f62eab-cb53-4b7f-a78f-2c97b8c85493",
   "metadata": {},
   "source": [
    "### Compute Transition Matrix - 5 Points\n",
    "\n",
    "#### This function calculates the transition matrix for a Hidden Markov Model used in POS tagging.\n",
    "- The matrix represents the probabilities of transitioning from one POS tag to another.\n",
    "- Students should fill in the blanks with appropriate code snippets based on the hints provided.\n",
    "- It's important to understand the role of normalization and smoothing in the context of probability matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7769e5a3-199e-4d16-a423-bba5c7b7cf24",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to compute transition matrix\n",
    "\n",
    "def compute_transition_matrix(tagged_sentences, tag_to_index, smoothing=0.001):\n",
    "    # Question 1: Determine the number of unique POS tags. Fill in the blank with the appropriate code.\n",
    "    # Hint: The number of unique POS tags is the length of the 'tag_to_index' dictionary.\n",
    "    tag_count = ___  # Fill in the blank\n",
    "\n",
    "    # Initialize the transition matrix with zeros\n",
    "    # Question 2: Create a square matrix with dimensions equal to the number of tags, initialized to zeros.\n",
    "    # Hint: Use NumPy's zeros function with the appropriate size.\n",
    "    transition_matrix = np.zeros((___, ___))  # Fill in the blanks\n",
    "\n",
    "    # Counting transitions between tags\n",
    "    for sentence in tagged_sentences:\n",
    "        for i in range(len(sentence) - 1):\n",
    "            current_tag, next_tag = sentence[i][1], sentence[i + 1][1]\n",
    "            # Question 3: Increment the count for the transition from 'current_tag' to 'next_tag'.\n",
    "            # Hint: Use the 'tag_to_index' dictionary to find the indices for the tags.\n",
    "            transition_matrix[tag_to_index[___]][tag_to_index[___]] += 1  # Fill in the blanks\n",
    "\n",
    "    # Normalize to get probabilities\n",
    "    # Question 4: Normalize each column in the transition matrix so that it sums to 1.\n",
    "    # Hint: Divide each value in a column by the sum of its column. Use a column Stochaistic approach\n",
    "    for i in range(tag_count):\n",
    "        transition_matrix____ /= transition_matrix___.sum()\n",
    "\n",
    "    # Add smoothing\n",
    "    # Question 5: Apply smoothing to the transition matrix and re-normalize.\n",
    "    # Hint: Add the 'smoothing' value to each element and then normalize again.\n",
    "    ___ += ___  # Fill in the blank\n",
    "    for i in range(tag_count):\n",
    "        transition_matrix[:, i] /= transition_matrix[:, i].sum()\n",
    "\n",
    "    return transition_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b94fad7-82fd-49bd-a677-8591f6689e4c",
   "metadata": {},
   "source": [
    "### Compute Emission Matrix - 5 Points\n",
    "\n",
    "#### This function calculates the emission matrix for a Hidden Markov Model used in POS tagging.\n",
    "- The matrix represents the probabilities of each word being generated from a given POS tag.\n",
    "- Students should fill in the blanks based on the context of the questions and the hints provided.\n",
    "- Understanding the role of the emission matrix in HMM, the concept of smoothing, and the significance of normalization in probability matrices is crucial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10c7c765-111a-4092-80f6-2e478ac961f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to compute emission matrix\n",
    "\n",
    "def compute_emission_matrix(tagged_sentences, tag_to_index, smoothing=0.001):\n",
    "    # Question 1: Create a set of all unique words in the tagged_sentences. \n",
    "    # Hint: Use a set comprehension to extract words from each sentence.\n",
    "    words = set([___ for sentence in tagged_sentences for word, _ in sentence])  # Fill in the blank\n",
    "\n",
    "    # Question 2: Create a dictionary mapping each word to a unique index.\n",
    "    # Hint: Enumerate over the 'words' set to create this mapping.\n",
    "    word_to_index = {word: index for index, word in _____(_____)}\n",
    "\n",
    "    # Question 3: Initialize the emission matrix with zeros.\n",
    "    # Hint: The matrix dimensions should be the number of tags by the number of words.\n",
    "    emission_matrix = np._____((len(tag_to_index), len(_______)))  # Fill in the blank\n",
    "\n",
    "    # Count emissions\n",
    "    for sentence in tagged_sentences:\n",
    "        for word, tag in sentence:\n",
    "            # Question 4: Increment the count in the emission matrix for the observed word-tag pair.\n",
    "            # Hint: Use 'tag_to_index' and 'word_to_index' to find the correct indices.\n",
    "            emission_matrix[tag_to_index[___]][word_to_index[___]] += 1  # Fill in the blanks\n",
    "\n",
    "    # Normalize to get probabilities\n",
    "    for i in range(len(tag_to_index)):\n",
    "        emission_matrix[:, i] /= emission_matrix[:, i].sum()\n",
    "    \n",
    "    # Add smoothing for unknown words\n",
    "    emission_matrix += smoothing\n",
    "    for i in range(len(tag_to_index)):\n",
    "        # Question 5: Re-normalize the matrix after adding smoothing.\n",
    "        # Hint: Ensure that each column of the matrix sums to 1 after smoothing.\n",
    "        emission_matrix_____ /= emission_matrix____.sum()\n",
    "\n",
    "    return emission_matrix, word_to_index\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "390d17f2-9387-4533-abe3-11452b9e33ac",
   "metadata": {},
   "source": [
    "### Compute Initial Probabilities - 5 Points\n",
    "\n",
    "#### This function calculates the initial probabilities for a Hidden Markov Model used in POS tagging.\n",
    "- The array initial_probabilities represents the probability of each POS tag being the start tag in a sentence.\n",
    "- Students should fill in the blanks based on the context of the questions and the hints provided.\n",
    "- Understanding the concept of initial probabilities in HMM and the importance of normalization in probability distributions is crucial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d07aa018-5f83-4c6c-b0b6-91c787094984",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to compute initial probabilities\n",
    "\n",
    "def compute_initial_probabilities(tagged_sentences, tag_to_index):\n",
    "    # Initialize an array for storing initial probabilities\n",
    "    # Question 1: Initialize the initial_probabilities array with zeros.\n",
    "    # Hint: The length of the array should be equal to the number of unique POS tags.\n",
    "    initial_probabilities = np._____(len(___))  # Fill in the blank\n",
    "\n",
    "    # Count the initial tag occurrences\n",
    "    for sentence in tagged_sentences:\n",
    "        # Question 2: Extract the first tag from each sentence.\n",
    "        # Hint: Each sentence is a list of (word, tag) pairs.\n",
    "        initial_tag = ______[0][1]\n",
    "\n",
    "        # Question 3: Increment the count for this tag in the initial_probabilities array.\n",
    "        # Hint: Use 'tag_to_index' to find the index corresponding to the initial_tag.\n",
    "        initial_probabilities[tag_to_index[___]] += 1  # Fill in the blank\n",
    "\n",
    "    # Normalize to get probabilities\n",
    "    # Question 4: Normalize the initial_probabilities array so it sums to 1.\n",
    "    # Hint: Divide each element by the sum of the entire array.\n",
    "    _______ /= _____.sum()\n",
    "\n",
    "    return initial_probabilities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dd42173-5b34-48ab-86ee-a448120da377",
   "metadata": {},
   "source": [
    "### Problem 2 - Modified Viterbi Algorithm [10 points]\n",
    "\n",
    "- The Viterbi function implements the Viterbi algorithm for POS tagging using a Hidden Markov Model.\n",
    "- Students need to fill in the blanks based on the context of the questions and the hints provided.\n",
    "- The algorithm involves initializing matrices, conducting a forward pass to calculate probabilities, and a backward pass to determine the most likely sequence of states (POS tags).\n",
    "- Understanding each step of the Viterbi algorithm and how it applies to POS tagging is crucial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1b0a40d-b5fb-42c0-bdb9-85dcf500e0d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def Viterbi(y, A, B, Pi, word_to_index, tag_to_index, index_to_tag, unknown_word_prob=1e-6):\n",
    "    # Question 1: Determine the number of states (unique POS tags) in the HMM.\n",
    "    # Hint: The number of states is the number of rows in the transition matrix A.\n",
    "    N = A._____[0]  # Fill in the blank\n",
    "\n",
    "    # Question 2: Determine the length of the observed sequence.\n",
    "    # Hint: The length of the sequence 'y'.\n",
    "    T = ______  # Fill in the blank\n",
    "\n",
    "    # Convert the observed words to their respective indices\n",
    "    y_indices = [word_to_index.get(word, -1) for word in y]\n",
    "\n",
    "    # Initialize matrices C and D\n",
    "    C = np.empty((N, T), 'd')\n",
    "    D = np.empty((N, T), 'B')\n",
    "\n",
    "    # Initialization stage\n",
    "    # Question 3: Initialize the first column of matrix C.\n",
    "    # Hint: Use the initial state distribution Pi and the emission matrix B.\n",
    "    for state in range(N):\n",
    "        obs_index = y_indices[0]\n",
    "        if obs_index >= 0:\n",
    "            C[state, ____] = B[state, obs_index] * ___[state]\n",
    "        else:\n",
    "            C[state, ____] = ___[state] * unknown_word_prob  # Smoothing for unknown words\n",
    "    D[:, 0] = 0\n",
    "\n",
    "    # Forward pass\n",
    "    # Question 4: Implement the forward pass to fill matrix C.\n",
    "    # Hint: Calculate the maximum probability for each state at time 't'.\n",
    "    for t in range(1, T):\n",
    "        for state in range(N):\n",
    "            obs_index = y_indices[t]\n",
    "            if obs_index >= 0:\n",
    "                prob = B[state, obs_index]\n",
    "            else:\n",
    "                prob = unknown_word_prob  # Smoothing for unknown words\n",
    "\n",
    "            C[state, t] = np.____(___)  # Fill in the blank\n",
    "            D[state, t] = np.____(___)  # Fill in the blank\n",
    "\n",
    "    # Correct the indices in matrix D\n",
    "    D[:, 1:] += 1\n",
    "\n",
    "    # Backward pass\n",
    "    # Question 5: Implement the backward pass to find the most probable state sequence.\n",
    "    # Hint: Trace back the path of maximum probabilities.\n",
    "    x = np.empty(T, 'B')\n",
    "    x[-1] = np.____(C[:, T-1]) + 1\n",
    "    for i in reversed(range(1, T)):\n",
    "        x[i-1] = D[x[i]-1, i]\n",
    "\n",
    "    # Convert state indices back to tags\n",
    "    return [index_to_tag[state-1] for state in x], C, D\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "8b3b86f4-066e-4b4c-a682-9f4769bd3342",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Observed Words: ['The', 'dog', 'barked']\n",
      "Tagged Sequence: ['DT', 'NNP', 'SYM']\n"
     ]
    }
   ],
   "source": [
    "# Main execution\n",
    "tagged_sentences = treebank.tagged_sents()\n",
    "tags = set(tag for sent in tagged_sentences for _, tag in sent)\n",
    "tag_to_index = {tag: i for i, tag in enumerate(tags)}\n",
    "index_to_tag = {i: tag for tag, i in tag_to_index.items()}\n",
    "\n",
    "transition_matrix = compute_transition_matrix(tagged_sentences, tag_to_index)\n",
    "emission_matrix, word_to_index = compute_emission_matrix(tagged_sentences, tag_to_index)\n",
    "initial_probabilities = compute_initial_probabilities(tagged_sentences, tag_to_index)\n",
    "\n",
    "# Example observed sequence\n",
    "observed_words = ['The', 'dog', 'barked']\n",
    "observed_indices = [word_to_index[word] if word in word_to_index else -1 for word in observed_words]\n",
    "\n",
    "# Running Viterbi Algorithm\n",
    "tagged_sequence, C, D = Viterbi(observed_words, transition_matrix, emission_matrix, initial_probabilities, word_to_index, tag_to_index, index_to_tag)\n",
    "\n",
    "# Display results\n",
    "print(\"Observed Words:\", observed_words)\n",
    "print(\"Tagged Sequence:\", tagged_sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "884c3c63-5d1f-4f78-8579-c455a13b1358",
   "metadata": {},
   "source": [
    "### Finding Model Accuracy - 7 Points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "4aa74684-c679-43ff-b611-eef228f85bb1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the dataset into training and testing sets\n",
    "tagged_sentences = treebank.tagged_sents()\n",
    "\n",
    "# Question 1: Calculate the split point for dividing the dataset.\n",
    "# Hint: Typically, the dataset is divided into 80% for training and 20% for testing.\n",
    "# Fill in the blank with the appropriate calculation.\n",
    "split_point = int(len(tagged_sentences) * ___)\n",
    "\n",
    "# Question 2: Split the tagged_sentences into training and testing sets.\n",
    "# Hint: Use slicing based on the split_point.\n",
    "# Fill in the blanks to create the train_set and test_set.\n",
    "train_set = tagged_sentences[:___]\n",
    "test_set = tagged_sentences[___:]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ad7d9843-0cc5-4df2-9936-2c41544731f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Testing: Predict tags for each sentence in the test set and calculate accuracy\n",
    "correct_tags = 0\n",
    "total_tags = 0\n",
    "\n",
    "for sentence in test_set:\n",
    "    # Question 1: Extract the words from each sentence.\n",
    "    # Hint: Each sentence is a list of (word, tag) tuples.\n",
    "    words = [___ for word, _ in sentence]  # Fill in the blank\n",
    "\n",
    "    # Question 2: Extract the actual tags from each sentence.\n",
    "    # Hint: You need the tags to compare with your model's predictions.\n",
    "    actual_tags = [___ for _, tag in sentence]  # Fill in the blank\n",
    "\n",
    "    # Use the Viterbi algorithm to predict tags\n",
    "    # Assuming 'Viterbi' is your function to predict tags using the HMM model\n",
    "    predicted_tags, _, _ = Viterbi(words, transition_matrix, emission_matrix, initial_probabilities, word_to_index, tag_to_index, index_to_tag)\n",
    "\n",
    "    # Question 3: Compare predicted tags with actual tags and count correct predictions.\n",
    "    # Hint: Increment 'correct_tags' if a predicted tag matches the actual tag.\n",
    "    correct_tags += sum(pred == actual for pred, actual in ___)  # Fill in the blank\n",
    "    total_tags += len(actual_tags)\n",
    "\n",
    "# Question 4: Calculate the accuracy of the model.\n",
    "# Hint: Accuracy is the ratio of correct predictions to total predictions.\n",
    "accuracy = _____ / _____\n",
    "print(\"Accuracy of the POS tagging model:\", accuracy)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8af7a8d3-aa7d-4fee-8e7f-59b604c4765c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of the POS tagging model: 0.9785418434053595\n"
     ]
    }
   ],
   "source": [
    "# Calculate accuracy\n",
    "accuracy = correct_tags / total_tags\n",
    "print(\"Accuracy of the POS tagging model:\", accuracy)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}