Home Page > > Details

๐๐จ๐ญ๐: It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible. You should read the specs carefully at least 2-3 times before you start coding.

Submission deadline for the Project is 20:59:59 (08:59:59 PM) on 25th Nov, 2020

LATE PENALTY: -10% on day-1, -20% on day-2 and -70% on day-3 (i.e., 0 mark after 3 days late)

Updates

Important Update on 2020-10-29:

Due to precision problem in float numbers (see the example below), we will modify our inverted index implementation such that the normalized tf-idf weights, ๐ค๐ก,๐, are integers.

Note that this does not require any change in your implementation of the WAND algorithm.

0.3333 + 0.6667 + 0.6667

0.6667 + 0.6667 + 0.3333

1.6666999999999998

(0.3333 + 0.6667 + 0.6667) == (0.6667 + 0.6667 + 0.3333)

False

Clarification on ties.

When two documents have the same score, we break the tie by retaining the document with the smaller document ID.

Project Specification

Instructions

This note book contains instructions for ๐๐๐๐6714-๐๐ซ๐จ๐ฃ๐๐๐ญ.

You are required to complete your implementation for part-1 in a file project_part1.py provided along with this notebook. Please ๐๐ ๐๐๐ ๐๐๐๐๐ the name of the file.

You are required to complete a report for part-2 in a file project_part2.pdf.

You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.

You can submit your implementation for Project (Part-1) and a report for Project (Part-2) via submission system: http://kg.cse.unsw.edu.au/submit/ . We have already sent out the invitations for you to join the submission system. In case of problems please post your request @ Piazza.

For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.

You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. Only functions are allowed in project_part1.py

You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence 0 mark for your project. If you are not sure, please ask @ Piazza.

We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day.

For the Final Evaluation, we will be using a different dataset, so your final scores may vary.

You are allowed to have a limited number of Feedback Attempts (15 ๐๐ญ๐ญ๐๐ฆ๐ฉ๐ญ๐ฌ ๐๐จ๐ซ ๐๐๐๐ก ๐ฌ๐ญ๐ฎ๐๐๐ง๐ญ), we will use your LAST submission for Final Evaluation.

Allowed Libraries:

You are required to write your implementation for the project (part-1) using Python 3.6.5. You are allowed to use any python standard libraries(https://docs.python.org/3.6/library/).

Part One (80 Points)

Introduction

In this project, you are required to implement WAND algorithm. We use the variant version in 2013 ADCS paper [Exploring the Magic of WAND] Algorithm 1 WAND processing. We will provide a inverted indexing model which in Inv_Index.py Please do not change this file.

Inputs:

Input to InvertedIndex model is as follow:

Documents (๐ท) as a dictionary with ๐๐๐ฆ: doc_id; ๐ฃ๐๐๐ข๐: document text

During query, your WAND algorithm are received as follows:

query_terms as a list with ๐ก๐๐๐: a string list

top_k as a Integer :๐ก๐๐_๐≥1, the number of documents you need to return

inverted_index as a dictionary with ๐๐๐ฆ: term; ๐ฃ๐๐๐ข๐: posting list, each element in posting list is a tuple (๐๐๐_๐๐,๐ค๐ก,๐)

Toy Example for Illustration

Here, we provide a small toy example for illustration of Inverted Index:

Let the dictionary of documents (๐ท) be:

documents = {1:'President Trump was on his way to new New York in New York City.',

2:'New York Times mentioned an interesting story about Trump.',

3:'I think it would be great if I can travel to New York this summer to see Trump.'}

We will use the provided inverted indexing model to get the inverted index. We do not use any preprocessing method(e.g. remove punctuation, stemming, case folding) and only use simple split function in python to get terms for each document.

Thus, the term list of doc_id = 1 would be:

['President', 'Trump', 'was', 'on', 'his', 'way', 'to', 'new', 'New', 'York', 'in', 'New', 'York', 'City.']

We will calcuate normalized tf-idf as ๐ค๐ก,๐, the formula would be:

๐ค๐ก,๐=⌈(1+ln(1+ln(๐ก๐๐ก,๐)))⌉⋅⌈(1+ln(|๐|1+๐๐๐ก))⌉

You can use InvertedIndex(documents).get_inverted_index() to get inverted index:

inverted_index = InvertedIndex(documents).get_inverted_index()

The inverted_index of term ๐๐๐ค for ๐ท would be:

'New': [(1, 2), (2, 1), (3, 1)]

Let the query_terms (๐) be:

Q = ["President","New","York"]

Output Format:

Your output should be two elements:

topk_result a list of the form (score, doc_id), where

score a float: corresponds to the sum of TF-IDF score among all the term based on the intersection of ๐ and ๐ท.

doc_id a integer: is document unique id.

the list should be sorted by score in decrease order, if two documents have same score, smaller document id precedes larger one.

full_evaluation_count a integer: the number of documents fully evaluated in WAND algorithm.

Notice: During WAND processing, if multiple documents in top-k with same scores and we need to remove one of them, we remove the one with largest document ID.

Running Time:

On CSE machines, your implementation for WAND algorithm ๐๐๐๐๐๐ ๐๐๐ take more than 120 seconds in our submit system.

In our testcases, the number of documents(with average around 400 terms) would not be greater than 3000, the number of terms in ๐ would not be greater than 20, and k would not be greater than 100.

How we test implementation of part one

import pickle

from Inv_Index import InvertedIndex

from project_part1 import WAND_Algo

fname = './Data/sample_documents.pickle'

documents = pickle.load(open(fname,"rb"))

โ

documents

{1: 'President Trump was on his way to new New York in New York City.',

2: 'New York Times mentioned an interesting story about Trump.',

3: 'I think it would be great if I can travel to New York this summer to see Trump.'}

## 1. Construct and get inverted_index

inverted_index = InvertedIndex(documents).get_inverted_index()

## Test cases

query_terms = ["President","New","York"]

top_k = 2

โ

## 2. WAND algorithm...

topk_result, full_evaluation_count = WAND_Algo(query_terms, top_k, inverted_index)

โ

print('Top-k result = ', topk_result)

print('Evaluation Count = ', full_evaluation_count)

Top-k result = [(6, 1), (2, 2)]

Evaluation Count = 3

Part Two (20 Points)

Introduction

Threshold ๐ผ is a very important parameter in WAND algorithm. If we can set a suitable value as initial threshold, we can reduce the number of fully evaluated documents. However, if we set a very high value, we may get wrong top-k result.

Notice: we only consider the WAND algorithm in lecture slides(i.e., Algorithm 1 described in the ADCS 2013 paper).

With an important modification: Use the following 6 lines of code to replace Lines 28 to 34 of the ADCS2013 algorithm.

This modification is only used in part two.

Assumptions

Assume that there is no two documents who achieve the same score for the query ๐.

Assume that we already had top-k answer for ๐.

Task

In this part, you need to complete a report project_part2.pdf answering the followoing questions:

Find out the the upper bound (๐ผ๐๐๐ฅ) of threshold ๐ผ such that

∀๐ผ<๐ผ๐๐๐ฅ, the algorithm (NB: the modified version of ADCS 2013 algorithm) will always return the correct top-k answers for ๐;

∀๐ผ≥๐ผ๐๐๐ฅ, the algorithm may return the wrong top-k answers for ๐ in some cases.

Prove that your answer is correct in a rigorous way.

Note: if we set ๐ผ=min{๐ ๐๐๐๐(๐ท;๐)โฃ๐ท∈top-k(๐)}, the algorithm will always return the correct top-k answers. However, we are sessking a higher value for ๐ผ here.

Project Submission and Feedback

For project submission, you are required to submit the following files via submission system: http://kg.cse.unsw.edu.au/submit/:

A python file project_part1.py.

A report project_part2.pdf.

Note: Every student will be entitled to 15 Feedback Attempts (use them wisely), we will use the last submission for final evaluation.

Contact Us(Ghostwriter Service)

- QQ๏ผ99515681
- WeChat๏ผcodinghelp2
- Email๏ผ99515681@qq.com
- Work Time๏ผ8:00-23:00

- Cs2461-10 Programminghelp With ,Ghostw... 2021-03-02
- Ghostwriter Program Programming,Help W... 2021-03-02
- Programming Coursehelp With ,Ghostwrit... 2021-03-02
- Ghostwriter Csc1-Ua Programming,Help W... 2021-03-02
- Help With Program Programming,Ghostwri... 2021-03-02
- Ghostwriter Data Programming,Help With... 2021-03-02
- Cse 13S Programminghelp With ,Ghostwri... 2021-03-02
- Mat136h5 Programminghelp With ,C/C++ P... 2021-03-01
- Ghostwriter Ee425x Programming,Help Wi... 2021-03-01
- Cscc11 Programming Coursehelp With ,Gh... 2021-03-01
- Ghostwriter Program Programming,Python... 2021-03-01
- Help With R Programming|Help With Data... 2021-03-01
- Data Structuresghostwriter ,Help With ... 2021-03-01
- Help With Data Programming,C++๏ผPython... 2021-03-01
- Ghostwriter Aps 105 Programming,C/C++ ... 2021-03-01
- Fre6831 Computational Finance 2021-02-28
- Sta141b Assignment 5 Interactive Visu... 2021-02-28
- Eecs2011a-F20 2021-02-28
- Comp-251 Final Asssessment 2021-02-28
- Ghostwriter Cs1027 Course Programming,... 2021-02-28

Contact Us - Email๏ผ99515681@qq.com WeChat๏ผcodinghelp

ยฉ 2014 www.asgnhelp.com

Programming Assignment Help๏ผ