Home Page > > Details

COMP6714-Project (Part-1)

Processing math: 63%
Instructions
This note book contains instructions for COMP6714-Project (Part-1). We will release the instructions for the Part-2 of the Project in a seperate notebook.
You are required to complete your implementation for part-1 in a file project_part1.py provided along with this notebook. Please DO NOT ALTER the name of the file.
 
You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.
 
You can submit your implementation for Project (Part-1) via submission system: http://kg.cse.unsw.edu.au/submit/ . We have already sent out the invitations for you to join the submission system. In case of problems please post your request @ Piazza.
 
For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.
 
You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. Only functions are allowed in project_part1.py
 
You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence 0 mark for your project. If you are not sure, please ask @ Piazza.
 
We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day.
 
For the Final Evaluation, we will be using a different dataset, so your final scores may vary.
 
You are allowed to have a limited number of Feedback Attempts (15 Attempts for each student), we will use your LAST submission for Final Evaluation.
 
Allowed Libraries:
You are required to write your implementation for the project (part-1) using Python 3.6.5. You are only allowed to use the following python libraries:
 
spacy (v2.1.8)
Q1: Compute TF-IDF score for query (80 Points)
Introduction
In this project, you are required to compute TF-IDF score of a document Dj w.r.t an input query Q and a Dictionary of Entities (DoE).
 
Inputs (Q1):
Inputs to your model are as follows:
 
Documents (D) as a dictionary with key: doc_id; value: document text
Query (Q), as a string of words
Dictionary of Entities (DoE), with key: entity; value: entity_id
The procedure for computation of the TF-IDF score follows following steps:
 
TF-IDF index construction for Entities and Tokens
Split the query into lists of Entities and Tokens
Query Score Computation
Detailed description of these steps is as under:
 
1. TF-IDF index construction for Entities and Tokens
We require you to separately learn TF-IDF index for tokens (TF-IDFtoken) and entities (TF-IDFentity). The computation of each of the TF-IDF index is given as follows:
 
TF-IDF index For Tokens:
The term frequency of the token t in a document Dj is computed as follows:
 
TFtoken(t,Dj)=#oftimestokentappearsinDj
To de-emphasize the high token frequency, we apply double log to normalize the term frequency. The computation of normalized term frequency of token t is illustrated as follows:
 
TFnorm_token(t,Dj)=1.0+ln(1.0+ln(TFtoken(t,Dj)))
And, the Inverse Document Frequency of the token t is computed as follows:
 
IDFtoken(t)=1.0+ln(
total#ofdocs
1.0+#ofdocscontainingtokent
 
)
The TF-IDF score of token t in document Dj is computed as: 
 
TF-IDFtoken(t,Dj)=TFnorm_token(t,Dj)∗IDFtoken(t)
TF-IDF index for Entities:
The term frequency of the entity e in a document Dj is computed as follows:
 
TFentity(e,Dj)=#oftimesentityeappearsinDj
We simply use natural log to normalize the term frequency of the entities, as given below:
 
TFnorm_entity(e,Dj)=1.0+ln(TFentity(e,Dj))
And, the Inverse Document Frequency of the entity e is computed as follows:
 
IDFentity(e)=1.0+ln(
total#ofdocs
1.0+#ofdocscontainingentitye
 
)
The TF-IDF score of the entity e in the document Dj is computed as: 
 
TF-IDFentity(e,Dj)=TFnorm_entity(e,Dj)∗IDFentity(e)
Note: We assign TF-IDF score = 0.0 for the cases where the term frequency (TF) for the token and/or entity is ZERO.
 
2. Split the Query into Entities and Tokens:
At first, you are required to split the query (Q) into all possible combinations of free keywords, i.e., tokens (K={ki}
N
i=1
) and entities (E={ei}
N
i=1
), where entities correspond to a subset of entities found in DoE formed by individual and/or combination of tokens in Q. This process is explained below:
 
Step 1: We look for probable entities in the Q by considering individual and/or combination of query tokens formed by combining the tokens in the increasing order of the query string. Amonst them, we only select the entities present in DoE.
Step 2: Based on the selected list of entities found in Step-1 enumerate all possible subsets of entities.
Step 3: Filter subsets of entities found in Step-2 such that for each subset the token count does not exceed the corresponding token count in Q. We treat the filtered subset as the final entities of the corresponding query split.
Step 4: For each filtered entity subset, the rest of the keywords in the query, i.e., (Q∖wordsInEntities(ei)) are treated as the tokens of the query split.
 
Formally, let query be a a string of tokens, e.g., Q="ABCDEFG" and dictionary of entities be DoE={AB,DF,GK}. The list of entities formed by the tokens in the query and/or combinations of query tokens (contained in DoE) is [AB,DF] and upon enumerating the possible subsets of the entities, we get following different possible splits of the query to the lists of the entities and the tokens:
 
Split-1: e1=[]; k1=[A,B,C,D,E,F,G]
Split-2: e2=[AB]; k2=[C,D,E,F,G]
Split-3: e3=[DF]; k3=[A,B,C,E,G]
Split-4: e4=[AB,DF]; k4=[C,E,G]
Note: 
 
In order to split the query, we only care about the subset of entities contained in DoE that can be formed by individual and/or combination of tokens in the Q.
Entities in DoE may correspond to single and/or multiple tokens, e.g., in the example given above A, ABC etc., may also correspond to valid entities and may appear in the DoE.
 
Maximum number of query splits are upper-bounded by the subset of the entities in DoE that can be formed by the tokens in the Q.
 
For every query split, the leftover keywords Q∖wordsInEntities(ei) are considered as the corresponding token split.
 
In order to form entities, we only combine keywords in the increasing order of the elements in the query string. For example, in Q="ABCDEFG", the entities such as: BA, CB etc., will not be considered as entities and hence will not appear in the DoE.
 
In the example given above, if DoE = \{AB, BC\}, then there will be only three possible splits of the query. Because the Q contains only one instance of the token B, so it will not be possible to form a subset with multiple entities [AB, BC], as it would require at least two instances of token B in the Q (also discussed in \textbf{Step-3} above ).
 
3. Query Score Computation:¶
Later, you are required to use the corresponding TF\text{-}IDF index to separately compute scores for the list of tokens and entities corresponding to each query split, i.e., (k_{i},e_{i}), \textit{w.r.t} the document D_{j} as follows:
 
s_{i1} = \sum_{entity \in e_{i}} TF_{norm\_entity}(entity,D_{j}) * IDF_{entity}(entity) \\ s_{i2} = \sum_{token \in k_{i}} TF_{norm\_token}(token,D_{j}) * IDF_{token}(token) \\ score_{i}(\{k_{i},e_{i}\}, D_{j}) = s_{i1} + \lambda * s_{i2}|_{\lambda = 0.4}
Finally, you are required to return the maximum score among all the query splits, i.e.,
 
score_{max} = max\{score_{i}\}_{i=1}^{N}\\
Note, in the above-mentioned equations, we use two separate TF\text{-}IDF indexes, i.e., (TF\text{-}IDF_{token}) and (TF\text{-}IDF_{entity}) to compute the scores for the token splits and the entity splits of the query respectively.
 
Some key instructions regarding TF-IDF indexing, parsing and text processing are as follows:
 
Key Instructions:
Note that for a given set of documents, you only need to index the documents only once and later use your index to compute the query scores.
You are only allowed to use Spacy (v2.1.8) for text processing and parsing. You can install the Spacy via following web-link: Spacy
 
We assume the parsing result of Spacy is always correct, we will not cater to any in-consistency in the Spacy's parsing results.
 
All the tokens in the documents (D), query (Q) and dictionary of entities (DoE) are case-sensitive. You \textbf{SHOULD NOT ALTER} the case of the tokens.
 
You are required to compute two separate indexes, i.e., (i) For tokens, and (ii) For Entities, such that:
 
In order to compute the index of the Entities (i.e., TF\text{-}IDF_{entity}), you should index all the entities detected by spacy irrespective of their entity types and/or presence in DoE. For details on spacy's parsing and entity recognition, please see the web-link: Spacy Parsing
For single-word Entities, e.g., Trump etc., you should only compute the index corresponding to the entities. For such entities, you should not consider the corresponding token for computing the TF-IDF index of tokens.
For multi-word entities, e.g., New York Times etc., individual tokens corresponding to the entities should be considered as free tokens and should be indexed while TF-IDF index construction of tokens (i.e., TF\text{-}IDF_{token}).
Stopwords: You should only use the token's attribute is_stop on a string parsed by Spacy to declare any token as stopword and eventually remove it.
 
Punctuation: You should only use the token's attribute is_punct on a string parsed by Spacy to decalre any token as a punctuation mark and eventually remove it.
 
Special Cases: You should not explicitly strip out punctuations or amend the Spacy's tokenization and parsing results. Some examples in this regard are as follows:
 
In the sentence: I am going to U.S. the correctly extracted entity is U.S.
Likewise, in the sentence: I am going to school. the spacy will extract the token school and will consider the fullstop . as a punctuation mark.
Toy Example for Illustration
Here, we provide a small toy example for illustration: 
Let the dictionary of documents (D) be:
 
documents = {1:'President Trump was on his way to new New York in New York City.', 2:'New York Times mentioned an interesting story about Trump.', 3:'I think it would be great if I can travel to New York this summer to see Trump.'}
The term frequencies corresponding to the tokens (i.e., TF_{token}) are shown below as a dictionary of dictionary of the form: 
\{token : \{doc\_id: count\}\}.
 
{'President': {1: 1}, 'way': {1: 1}, 'new': {1: 1}, 'New': {1: 2, 2: 1, 3: 1}, 'York': {1: 2, 2: 1, 3: 1}, 'City': {1: 1}, 'Times': {2: 1}, 'mentioned': {2: 1}, 'interesting': {2: 1}, 'story': {2: 1}, 'think': {3: 1}, 'great': {3: 1}, 'travel': {3: 1}, 'summer': {3: 1}}
Likewise, The term frequencies corresponding to the entities (i.e., TF_{entity}) are shown below as a dictionary of dictionary of the form: 
\{entity : \{doc\_id: count\}\}.
 
{'Trump': {1: 1, 2: 1, 3: 1}, 'New York': {1: 1, 3: 1}, 'New York City': {1: 1}, 'New York Times': {2: 1}, 'this summer': {3: 1}}
Let the query (Q) be:
 
Q = 'New York Times Trump travel'
Let the DoE be:
 
DoE = {'New York Times':0, 'New York':1,'New York City':2}
The possible query splits are:
 
e_1 = [], k_1 = [New, York, Times, Trump, travel]
 
e_2 = [New York Times], k_2 = [Trump, travel]
 
e_3 = [New York], k_3= [Times, Trump, travel]
 
\textbf{Note:} We cannot select the query split with the entity part as the combination of following entities: e_{i} = [New York, New York Times], because there are only single instances of the tokens New and York in the Q.
 
For doc_id=3, after applying the formulas mentioned in sub-headings 2,3 given above, we get following scores for all the query splits:
 
query = {'tokens': ['New', 'York', 'Times', 'Trump', 'travel'], 'entities': []} {'tokens_score': 2.8301009632046026, 'entities_score': 0.0, 'combined_score': 1.132040385281841} query = {'tokens': ['Trump', 'travel'], 'entities': ['New York Times']} {'tokens_score': 1.4054651081081644, 'entities_score': 0.0, 'combined_score': 0.5621860432432658} query = {'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']} {'tokens_score': 1.4054651081081644, 'entities_score': 1.0, 'combined_score': 1.562186043243266}
And the maximum score max_score among all the query splits is: 
 
1.562186043243266 
 
And, the corresponding query split is:
 
{'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']}
 
Output Format (Q1):
Your output should be a tuple of the form:
(max_score, {'tokens':[...], 'entities':[...]}), where 
 
max_score corresponds to the maximum TF-IDF score among all the query splits based on Q and DoE.
The query split corresponding to the max_score, i.e., a python dictionary containing the tokens and entities list corresponding to the query split {'tokens':[...], 'entities':[...]}.
Running Time (Q1):
On CSE machines, your implementation for \textbf{parsing and indexing} approx 500 documents of average length of 500 tokens \textbf{SHOULD NOT take more than 120 seconds}.
Once all the documents are indexed, \textbf{the query spliting and score} computation \textbf{SHOULD NOT take more than 15 sec}.
How we test implementation of Q1
In [1]:
import pickle
import project_part1 as project_part1
In [2]:
fname = './Data/sample_documents.pickle'
documents = pickle.load(open(fname,"rb"))
 
documents
Out[2]:
{1: 'President Trump was on his way to new New York in New York City.',
 2: 'New York Times mentioned an interesting story about Trump.',
 3: 'I think it would be great if I can travel to New York this summer to see Trump.'}
In [3]:
## Step- 1. Construct the index...
index = project_part1.InvertedIndex()
 
index.index_documents(documents)
In [4]:
## Test cases
Q = 'New York Times Trump travel'
DoE = {'New York Times':0, 'New York':1,'New York City':2}
doc_id = 3
 
## 2. Split the query...
query_splits = index.split_query(Q, DoE)
 
## 3. Compute the max-score...
result = index.max_score_query(query_splits, doc_id)
result
Out[4]:
(1.562186043243266,
 {'tokens': ['Times', 'Trump', 'travel'], 'entities': ['New York']})
Project Submission and Feedback
For project submission, you are required to submit the following files:
 
Your implementation in a python file project_part1.py.
 
A report project_part1.pdf You need to write a concise and simple report illustrating
 
Implementation details of Q1.
Note: Every student will be entitled to 15 Feedback Attempts (use them wisely), we will use the last submission for final evaluation of part-1.
 
Contact Us - Email:99515681@qq.com    WeChat:codinghelp
Programming Assignment Help!