comp6714 project

𝐍𝐨𝐭𝐞: It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible. You should read the specs carefully at least 2-3 times before you start coding.

Submission deadline for the Project is 20:59:59 (08:59:59 PM) on 25th Nov, 2020

LATE PENALTY: -10% on day-1, -20% on day-2 and -70% on day-3 (i.e., 0 mark after 3 days late)

Updates

Important Update on 2020-10-29:

Due to precision problem in float numbers (see the example below), we will modify our inverted index implementation such that the normalized tf-idf weights, 𝑤𝑡,𝑑, are integers.

Note that this does not require any change in your implementation of the WAND algorithm.

0.3333 + 0.6667 + 0.6667

0.6667 + 0.6667 + 0.3333

1.6666999999999998

(0.3333 + 0.6667 + 0.6667) == (0.6667 + 0.6667 + 0.3333)

False

Clarification on ties.

When two documents have the same score, we break the tie by retaining the document with the smaller document ID.

Project Specification

Instructions

This note book contains instructions for 𝐂𝐎𝐌𝐏6714-𝐏𝐫𝐨𝐣𝐞𝐜𝐭.

You are required to complete your implementation for part-1 in a file project_part1.py provided along with this notebook. Please 𝐃𝐎 𝐍𝐎𝐓 𝐀𝐋𝐓𝐄𝐑 the name of the file.

You are required to complete a report for part-2 in a file project_part2.pdf.

You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.

You can submit your implementation for Project (Part-1) and a report for Project (Part-2) via submission system: http://kg.cse.unsw.edu.au/submit/ . We have already sent out the invitations for you to join the submission system. In case of problems please post your request @ Piazza.

For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.

You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. Only functions are allowed in project_part1.py

You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence 0 mark for your project. If you are not sure, please ask @ Piazza.

We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day.

For the Final Evaluation, we will be using a different dataset, so your final scores may vary.

You are allowed to have a limited number of Feedback Attempts (15 𝐀𝐭𝐭𝐞𝐦𝐩𝐭𝐬 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝐬𝐭𝐮𝐝𝐞𝐧𝐭), we will use your LAST submission for Final Evaluation.

Allowed Libraries:

You are required to write your implementation for the project (part-1) using Python 3.6.5. You are allowed to use any python standard libraries(https://docs.python.org/3.6/library/).

Part One (80 Points)

Introduction

In this project, you are required to implement WAND algorithm. We use the variant version in 2013 ADCS paper [Exploring the Magic of WAND] Algorithm 1 WAND processing. We will provide a inverted indexing model which in Inv_Index.py Please do not change this file.

Inputs:

Input to InvertedIndex model is as follow:

Documents (𝐷) as a dictionary with 𝑘𝑒𝑦: doc_id; 𝑣𝑎𝑙𝑢𝑒: document text

During query, your WAND algorithm are received as follows:

query_terms as a list with 𝑡𝑒𝑟𝑚: a string list

top_k as a Integer :𝑡𝑜𝑝_𝑘≥1, the number of documents you need to return

inverted_index as a dictionary with 𝑘𝑒𝑦: term; 𝑣𝑎𝑙𝑢𝑒: posting list, each element in posting list is a tuple (𝑑𝑜𝑐_𝑖𝑑,𝑤𝑡,𝑑)

Toy Example for Illustration

Here, we provide a small toy example for illustration of Inverted Index:

Let the dictionary of documents (𝐷) be:

documents = {1:'President Trump was on his way to new New York in New York City.',

2:'New York Times mentioned an interesting story about Trump.',

3:'I think it would be great if I can travel to New York this summer to see Trump.'}

We will use the provided inverted indexing model to get the inverted index. We do not use any preprocessing method(e.g. remove punctuation, stemming, case folding) and only use simple split function in python to get terms for each document.

Thus, the term list of doc_id = 1 would be:

['President', 'Trump', 'was', 'on', 'his', 'way', 'to', 'new', 'New', 'York', 'in', 'New', 'York', 'City.']

We will calcuate normalized tf-idf as 𝑤𝑡,𝑑, the formula would be:

𝑤𝑡,𝑑=⌈(1+ln(1+ln(𝑡𝑓𝑡,𝑑)))⌉⋅⌈(1+ln(|𝑑|1+𝑑𝑓𝑡))⌉

You can use InvertedIndex(documents).get_inverted_index() to get inverted index:

inverted_index = InvertedIndex(documents).get_inverted_index()

The inverted_index of term 𝑁𝑒𝑤 for 𝐷 would be:

'New': [(1, 2), (2, 1), (3, 1)]

Let the query_terms (𝑄) be:

Q = ["President","New","York"]

Output Format:

Your output should be two elements:

topk_result a list of the form (score, doc_id), where

score a float: corresponds to the sum of TF-IDF score among all the term based on the intersection of 𝑄 and 𝐷.

doc_id a integer: is document unique id.

the list should be sorted by score in decrease order, if two documents have same score, smaller document id precedes larger one.

full_evaluation_count a integer: the number of documents fully evaluated in WAND algorithm.

Notice: During WAND processing, if multiple documents in top-k with same scores and we need to remove one of them, we remove the one with largest document ID.

Running Time:

On CSE machines, your implementation for WAND algorithm 𝐒𝐇𝐎𝐔𝐋𝐃 𝐍𝐎𝐓 take more than 120 seconds in our submit system.

In our testcases, the number of documents(with average around 400 terms) would not be greater than 3000, the number of terms in 𝑄 would not be greater than 20, and k would not be greater than 100.

How we test implementation of part one

import pickle

from Inv_Index import InvertedIndex

from project_part1 import WAND_Algo

fname = './Data/sample_documents.pickle'

documents = pickle.load(open(fname,"rb"))

documents

{1: 'President Trump was on his way to new New York in New York City.',

2: 'New York Times mentioned an interesting story about Trump.',

3: 'I think it would be great if I can travel to New York this summer to see Trump.'}

## 1. Construct and get inverted_index

inverted_index = InvertedIndex(documents).get_inverted_index()

## Test cases

query_terms = ["President","New","York"]

top_k = 2

## 2. WAND algorithm...

topk_result, full_evaluation_count = WAND_Algo(query_terms, top_k, inverted_index)

print('Top-k result = ', topk_result)

print('Evaluation Count = ', full_evaluation_count)

Top-k result = [(6, 1), (2, 2)]

Evaluation Count = 3

Part Two (20 Points)

Introduction

Threshold 𝛼 is a very important parameter in WAND algorithm. If we can set a suitable value as initial threshold, we can reduce the number of fully evaluated documents. However, if we set a very high value, we may get wrong top-k result.

Notice: we only consider the WAND algorithm in lecture slides(i.e., Algorithm 1 described in the ADCS 2013 paper).

With an important modification: Use the following 6 lines of code to replace Lines 28 to 34 of the ADCS2013 algorithm.

This modification is only used in part two.

Assumptions

Assume that there is no two documents who achieve the same score for the query 𝑄.

Assume that we already had top-k answer for 𝑄.

Task

In this part, you need to complete a report project_part2.pdf answering the followoing questions:

Find out the the upper bound (𝛼𝑚𝑎𝑥) of threshold 𝛼 such that

∀𝛼<𝛼𝑚𝑎𝑥, the algorithm (NB: the modified version of ADCS 2013 algorithm) will always return the correct top-k answers for 𝑄;

∀𝛼≥𝛼𝑚𝑎𝑥, the algorithm may return the wrong top-k answers for 𝑄 in some cases.

Prove that your answer is correct in a rigorous way.

Note: if we set 𝛼=min{𝑠𝑐𝑜𝑟𝑒(𝐷;𝑄)∣𝐷∈top-k(𝑄)}, the algorithm will always return the correct top-k answers. However, we are sessking a higher value for 𝛼 here.

Project Submission and Feedback

For project submission, you are required to submit the following files via submission system: http://kg.cse.unsw.edu.au/submit/:

A python file project_part1.py.

A report project_part2.pdf.

Note: Every student will be entitled to 15 Feedback Attempts (use them wisely), we will use the last submission for final evaluation.

QQ：99515681
WeChat：codinghelp
Email：99515681@qq.com
Work Time：8:00-23:00

Hots

Help With Dts206tc Applied Linear Stat... 2024-04-17
Ghostwriter Apstat.Ge.2110 Ap Plied St... 2024-04-17
Ghostwriter Econ 1103 002 &Amp; 013 – 2024-04-17
Ghostwriter Busness 114 – Accounting F 2024-04-17
Help With Rest0004: Property Investmen... 2024-04-17
Help With Swen30006 Software Modelling... 2024-04-17
Help With Interconnected Worlds: Explo... 2024-04-17
Help With Problem Set #1 – Econ 121A, 2024-04-17
Help With Ad 685 Term Projectghostwrit... 2024-04-17
Ghostwriter Math 194: Problem Set #2 S... 2024-04-17
Ghostwriter Ceg 5301 Assignment 5 Ay 2... 2024-04-17
Ghostwriter Chin0085 Final Exam Questi... 2024-04-17
Ghostwriter Mth 214 Life Insurance Mat... 2024-04-17
Ghostwriter Comm1110 Evidence-Based Pr... 2024-04-17
Help With Cen103 – Solids And Structur 2024-04-17
Ghostwriter Stats 320 – Applied Stocha 2024-04-17
Ghostwriter Ac1103 – Accounting I Seme 2024-04-17
Help With Fit3175 Usability - S1 2024H... 2024-04-17
Ghostwriter Bx2031 Sp51 2024 Investmen... 2024-04-17
Help With Cs 20A: Data Structures With... 2024-04-17

Programming Assignment Help！