Home Page > > Details

comp6714 project

 ๐๐จ๐ญ๐ž: It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible. You should read the specs carefully at least 2-3 times before you start coding.

 
Submission deadline for the Project is 20:59:59 (08:59:59 PM) on 25th Nov, 2020
LATE PENALTY: -10% on day-1, -20% on day-2 and -70% on day-3 (i.e., 0 mark after 3 days late)
 
Updates
Important Update on 2020-10-29:
 
Due to precision problem in float numbers (see the example below), we will modify our inverted index implementation such that the normalized tf-idf weights, ๐‘ค๐‘ก,๐‘‘, are integers.
 
Note that this does not require any change in your implementation of the WAND algorithm.
 
0.3333 + 0.6667 + 0.6667
0.6667 + 0.6667 + 0.3333
1.6666999999999998
(0.3333 + 0.6667 + 0.6667) == (0.6667 + 0.6667 + 0.3333)
False
Clarification on ties.
 
When two documents have the same score, we break the tie by retaining the document with the smaller document ID.
 
Project Specification
Instructions
This note book contains instructions for ๐‚๐Ž๐Œ๐6714-๐๐ซ๐จ๐ฃ๐ž๐œ๐ญ.
 
You are required to complete your implementation for part-1 in a file project_part1.py provided along with this notebook. Please ๐ƒ๐Ž ๐๐Ž๐“ ๐€๐‹๐“๐„๐‘ the name of the file.
 
You are required to complete a report for part-2 in a file project_part2.pdf.
 
You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.
 
You can submit your implementation for Project (Part-1) and a report for Project (Part-2) via submission system: http://kg.cse.unsw.edu.au/submit/ . We have already sent out the invitations for you to join the submission system. In case of problems please post your request @ Piazza.
 
For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.
 
You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. Only functions are allowed in project_part1.py
 
You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence 0 mark for your project. If you are not sure, please ask @ Piazza.
 
We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day.
 
For the Final Evaluation, we will be using a different dataset, so your final scores may vary.
 
You are allowed to have a limited number of Feedback Attempts (15 ๐€๐ญ๐ญ๐ž๐ฆ๐ฉ๐ญ๐ฌ ๐Ÿ๐จ๐ซ ๐ž๐š๐œ๐ก ๐ฌ๐ญ๐ฎ๐๐ž๐ง๐ญ), we will use your LAST submission for Final Evaluation.
 
Allowed Libraries:
You are required to write your implementation for the project (part-1) using Python 3.6.5. You are allowed to use any python standard libraries(https://docs.python.org/3.6/library/).
 
Part One (80 Points)
Introduction
In this project, you are required to implement WAND algorithm. We use the variant version in 2013 ADCS paper [Exploring the Magic of WAND] Algorithm 1 WAND processing. We will provide a inverted indexing model which in Inv_Index.py Please do not change this file.
 
Inputs:
Input to InvertedIndex model is as follow:
 
Documents (๐ท) as a dictionary with ๐‘˜๐‘’๐‘ฆ: doc_id; ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’: document text
During query, your WAND algorithm are received as follows:
 
query_terms as a list with ๐‘ก๐‘’๐‘Ÿ๐‘š: a string list
top_k as a Integer :๐‘ก๐‘œ๐‘_๐‘˜≥1, the number of documents you need to return
inverted_index as a dictionary with ๐‘˜๐‘’๐‘ฆ: term; ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’: posting list, each element in posting list is a tuple (๐‘‘๐‘œ๐‘_๐‘–๐‘‘,๐‘ค๐‘ก,๐‘‘)
Toy Example for Illustration
Here, we provide a small toy example for illustration of Inverted Index:
Let the dictionary of documents (๐ท) be:
 
documents = {1:'President Trump was on his way to new New York in New York City.',
             2:'New York Times mentioned an interesting story about Trump.',
             3:'I think it would be great if I can travel to New York this summer to see Trump.'}
We will use the provided inverted indexing model to get the inverted index. We do not use any preprocessing method(e.g. remove punctuation, stemming, case folding) and only use simple split function in python to get terms for each document.
Thus, the term list of doc_id = 1 would be:
 
['President', 'Trump', 'was', 'on', 'his', 'way', 'to', 'new', 'New', 'York', 'in', 'New', 'York', 'City.']
We will calcuate normalized tf-idf as ๐‘ค๐‘ก,๐‘‘, the formula would be:
 
๐‘ค๐‘ก,๐‘‘=⌈(1+ln(1+ln(๐‘ก๐‘“๐‘ก,๐‘‘)))⌉⋅⌈(1+ln(|๐‘‘|1+๐‘‘๐‘“๐‘ก))⌉
You can use InvertedIndex(documents).get_inverted_index() to get inverted index:
 
inverted_index = InvertedIndex(documents).get_inverted_index()
The inverted_index of term ๐‘๐‘’๐‘ค for ๐ท would be:
 
'New': [(1, 2), (2, 1), (3, 1)]
Let the query_terms (๐‘„) be:
 
Q = ["President","New","York"]
Output Format:
Your output should be two elements:
topk_result a list of the form (score, doc_id), where
 
score a float: corresponds to the sum of TF-IDF score among all the term based on the intersection of ๐‘„ and ๐ท.
doc_id a integer: is document unique id.
the list should be sorted by score in decrease order, if two documents have same score, smaller document id precedes larger one.
full_evaluation_count a integer: the number of documents fully evaluated in WAND algorithm.
Notice: During WAND processing, if multiple documents in top-k with same scores and we need to remove one of them, we remove the one with largest document ID.
 
Running Time:
On CSE machines, your implementation for WAND algorithm ๐’๐‡๐Ž๐”๐‹๐ƒ ๐๐Ž๐“ take more than 120 seconds in our submit system.
In our testcases, the number of documents(with average around 400 terms) would not be greater than 3000, the number of terms in ๐‘„ would not be greater than 20, and k would not be greater than 100.
How we test implementation of part one
import pickle
from Inv_Index import InvertedIndex
from project_part1 import WAND_Algo
fname = './Data/sample_documents.pickle'
documents = pickle.load(open(fname,"rb"))
โ€‹
documents
{1: 'President Trump was on his way to new New York in New York City.',
 2: 'New York Times mentioned an interesting story about Trump.',
 3: 'I think it would be great if I can travel to New York this summer to see Trump.'}
## 1. Construct and get inverted_index
inverted_index = InvertedIndex(documents).get_inverted_index()
## Test cases
query_terms = ["President","New","York"]
top_k = 2
โ€‹
## 2. WAND algorithm...
topk_result, full_evaluation_count = WAND_Algo(query_terms, top_k, inverted_index)
โ€‹
print('Top-k result = ', topk_result)
print('Evaluation Count = ', full_evaluation_count)
Top-k result =  [(6, 1), (2, 2)]
Evaluation Count =  3
Part Two (20 Points)
Introduction
Threshold ๐›ผ is a very important parameter in WAND algorithm. If we can set a suitable value as initial threshold, we can reduce the number of fully evaluated documents. However, if we set a very high value, we may get wrong top-k result.
 
Notice: we only consider the WAND algorithm in lecture slides(i.e., Algorithm 1 described in the ADCS 2013 paper).
With an important modification: Use the following 6 lines of code to replace Lines 28 to 34 of the ADCS2013 algorithm.
 
 
 
This modification is only used in part two.
 
Assumptions
Assume that there is no two documents who achieve the same score for the query ๐‘„.
Assume that we already had top-k answer for ๐‘„.
Task
In this part, you need to complete a report project_part2.pdf answering the followoing questions:
 
Find out the the upper bound (๐›ผ๐‘š๐‘Ž๐‘ฅ) of threshold ๐›ผ such that
∀๐›ผ<๐›ผ๐‘š๐‘Ž๐‘ฅ, the algorithm (NB: the modified version of ADCS 2013 algorithm) will always return the correct top-k answers for ๐‘„;
∀๐›ผ≥๐›ผ๐‘š๐‘Ž๐‘ฅ, the algorithm may return the wrong top-k answers for ๐‘„ in some cases.
Prove that your answer is correct in a rigorous way.
Note: if we set ๐›ผ=min{๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐ท;๐‘„)โˆฃ๐ท∈top-k(๐‘„)}, the algorithm will always return the correct top-k answers. However, we are sessking a higher value for ๐›ผ here.
 
Project Submission and Feedback
For project submission, you are required to submit the following files via submission system: http://kg.cse.unsw.edu.au/submit/:
 
A python file project_part1.py.
A report project_part2.pdf.
Note: Every student will be entitled to 15 Feedback Attempts (use them wisely), we will use the last submission for final evaluation.
Contact Us - Email๏ผš99515681@qq.com    WeChat๏ผšcodinghelp
ยฉ 2014 www.asgnhelp.com
Programming Assignment Help๏ผ