A Natural Language Processing Project

A Better Indexing Method for Search Engine Using Reference Resolution

1      Introduction

The traditional indexing is word-based. The document is treated as a collection of words, and indexed based on word frequencies. This project will not only index but also taking the relationship between words into account. This is done by reference resolution. Since the implementation of indexing is not part of this course, this project will focus on reference resolution.

 

This program provides GUI by using MFC (Microsoft Foundation Class).

2      Inputs and Outputs

2.1    Inputs

As input, the program takes a regular ASCII text file.

Example:

John saw a good Acura at the dealership.

He showed it to Bob. He bought it.

 

Click on “Open” button, you will be prompt for a file, choose a text file. This file will be loaded into first Edit window. You can edit your text in this window.

Click on “All” button to run the program.

“FDG”, “Pre”, “DR”, and “Cof” buttons are used to execute intermediate step. See section 3.3. <!–[if supportFields]> REF _Ref26809089 r h <![endif]–><!–[if supportFields]><![endif]–>

This program needs Internet access to use Connexor parser. For more detail, see 3.3.

2.2    Outputs

The program outputs the final results in the fifth window.

 

The program outputs one line for each co-reference set found in that file. A co-referent set is a set of antecedent and pronouns that co-refer. The general form is co-referent set:

 

{Antecedent} | {Pronoun} | {Pronoun} | … | {Count}

 

The form of specification for both pronouns and antecedents in the same, with the elements of a specification separated by ‘+’ characters as delimiters:

{SentenceNumber}+{WordNumber}+{Text}

 Example output for the above input:

1+1+John | 2+1+He | 3+1+He | 3

 

This tells us that both “He” from the 2nd and the 3rd sentence refer to John in the first sentence. Size of this co-reference set for word “John” is 3.

3      Approach and Architecture

The approach used in the design presented here is based on the view that those nominal expressions that refer to the same real or imaginary entity can be considered as belonging to a co-referent set: a set of expressions that all refer to the same things. The task of reference resolution is to determine which co-referent set a given anaphor belongs to.

 

3.1    Basic concepts

3.1.1    Nominal Expressions

The term nominal expression refers to anaphor or antecedents. This term is commonly known as noun phrase. This term is used since it includes genitive forms such as Dell’s.

3.1.2    Referential Pronouns

Pronouns are a syntactically defined set, consisting of the words I, you, he, she, it, we, they plus their accusative and genitive forms, as in them and their respectively; the program is also designed to handle so-called lexical anaphora, by which we mean reflexive forms such as itself and themselves and reciprocal forms like each other.

 

Not all pronouns refer to things. We will use the term referential pronoun for those that do, and the term pleonastic pronoun for those that do not, as in the sentence “It is raining.”

 

This project supports only referential pronouns.

3.1.3    Discourse Referents

For each nominal expression that is assumed to be referential, we will construct a discourse referent. This is a data structure that is used to maintain various pieces of information about that nominal expression.

3.1.4    Co-referent Set

A co-referent set is a set of discourse referents that are believed to co-refer.

3.2    The Approach

First, we need to identify all the referential nominal expressions in the input text and construct a discourse referent for each. Second, for each discourse referent which corresponds to a pronoun, we need to identify which co-referent set it belongs to.

 

For the second part, this program primarily uses Kennedy and Boguraev (1996) algorithm, which bases on Lappin and Leass (1994), to resolve a pronoun to co-referent set.

3.3    The Architect

This program is divided into five parts: Parse, Preprocess text, Build discourse referents, Determine coreference relationships, output results.

3.3.1    Parse:

Any reference resolution algorithm required parsed text. Ke
nnedy and Boguraev (1996) use LingSoft parser. I choose Connexor parser since it provides Functional Dependency Grammar (FDG). This part is also known as FDG in this program. I use web interface to access Connexor parse; therefore, Internet access is required.

Execute this only by clicking on “FDG” button.

3.3.2    Preprocess text:

Build the internal tree representation from parsed text.

Execute this only by clicking on “Pre” button.

3.3.3    Build discourse referents

From that tree structure, nominal expressions will be identified.

Execute this only by clicking on “DR” button.

3.3.4    Determine coreference relationships

From those nominal expressions, coreference sets will be built.

Execute this only by clicking on “Cof” button.

3.3.5    Output results

Not only the final output but also the output from each part is displayed in a window for better debugging.

4      Bibliography

Christopher Kennedy and Branimir Boguraev [1996a] Anaphora in a wider context:

Tracking discourse referents. In Proceedings of the 12th European Conference on

Artificial Intelligence, pp582–586. August 11–16, Hungary.

 

Christopher Kennedy and Branimir Boguraev [1996b] Anaphora for everyone:

Pronominal anaphora resolution without a parser. In Proceedings of the 16th

International Conference on Computational Linguistics, Copenhagen, Denmark.

 

Shalom Lappin and Herbert J Leass [1994] An Algorithm for Pronominal Anaphora

Resolution. Computational Linguistics, Volume 20, Part 4, pages 535-561.

A Natural Language Processing Project