BMI 505-NLP Module

Helpfulness: 0
Set Details Share
created 4 years ago by gina_g23
63 views
updated 4 years ago by gina_g23
show moreless
Page to share:
Embed this setcancel
COPY
code changes based on your size selection
Size:
X
Show:
1

Define formal language

precise, unambiguous, easy for computers to comprehend

Example: public static void

2

Define natural language

vague, ambiguous, easy for humans to comprehend

Example: Final exams suck.

3

What are the levels of linguistic information

1) Text level

2) Morphological Level (Lexical), example: can, can't, cannot

3) Syntactic Level, example: organize groups of words in sentences/clauses

4) Semantic Level, example: meaning conveyed by such

5) Pragmatic Level

4

What are the contents of a document, and describe them

1) Information about a world (info retrieval, info extract, summarization, translation). Example: LotR, hobbits reside in the Shire....I extracted this information when reading the Fellowship of the Ring

2) Opinion about the information (sentiment analysis, recommendation engines, social media monitoring). Example: Everyone loves LotR, this information is gathered through basic twitter analyses

3) Information about the writer (author verification, plagiarism detection, author profiling). Example: Did J.R.R. Tolkein plagiarize LotR? Negative!

5

List the layers of language extraction

1) Lexical-words/terms

2) Syntactic- organization of words/terms into sentences/clauses

3) Semantic- meaning conveyed by such

4) Discourse- going across sentences

6

Main Problem of NLP?

Ambiguity. Example: Several meaning for the sentence, I made her duck.

7

How to handle ambiguity?

1) Pipeline Processing-ignore ambiguity as it occurs and hope other levels eliminate incorrect structures

2) Probabilistic Approach-based on making the most likely choice

3) Don't do anything, maybe it won't matter

8

Biomedical text mining is important because

There's a lot of data hiding in Biomedical texts

9

Why don't we evaluate a system with same data used to train it

We want to avoid overfitting the data

Solution #1: Training and test sets
– Use one set specifically for training, one for
evaluation
• Solution #2: Cross-validation
– 10-fold cross-validation: Split data into 10 parts;
train on 9, test on 1; repeat for each part; average
– 5x2 cross-validation: Split data into 2 parts; train on
one, test on other; repeat 5 times; & average

10

What is stemming

Identifying lexical variants and unifying them by morphological analysis, assign canonical base form (example activat-es, activat-ed)

11

What's a lexicon

Upon relating cannonical forms to a lexical entry, identify parts of speech of the given tokens in a sentence

12

What is the Porter Algorithm

Lexicon-free approach to stemming. Matches strings right to left based on longest match. Example: activation does activ-ation (option of ion or ation).

13

Information Retrieval

Finding relevant documents for a specified information need

14

Information extraction

Method needed to find biomedical entities and associations among them

15

What is name entity recognition

Extract entities of interest (genes or drugs) from literature

16

Normalization

map entities to standard id, association between them (protein-protein, gene-drug). Links objects of potential interest (genes) to detailed information not found in publications. Gene mention normalization is hard because genes have tons of tricky names

17

Problem found in name entity recognition

problem of finding reference to entities (mentions) present in text and tagging them with their location and type

18

Hidden Markov Models

Sequence length of 1 would be equivalent to Naïve Bayes classifier
– Assumes that features are independent of each other
– Gives equal weight to each feature

19

What is regular expression

formal language for specifying text strings (i.e. protein, proteins, Protein, Proteins), formula in a special language that's used for specifying a simple class of string. Algebraic notation characterizing set of strings

20

What do regular expressions require?

Require pattern (what we want to search for) and corpus (texts to search through)

21

RE to find protein with lower or upper case

[Pp]rotein

22

RE to find any number

[0-9]

23

RE for any upper case or lower case letter

[A-Za-z]

24

RE for non-upper case

[^A-Z]

25

RE for no S or s

[^Ss]

26

RE for no e or ^

[^e^]

27

RE for finding a^b

a^b

28

RE to find Ibuprofen or Advil

Ibuprofen|Advil

29

RE for country or countries

countr(y|ies)

30

What does * do for RE

Example: oo*h

previous character may or may not be included and duplicated indefinitely

Example: oo*h! = oh!, ooh!, oooh!

31

What does + do for RE

Example: o+h!

previous character must be included and duplicated indefinitely

Example: o+h! = oh!, ooh!, oooh!

32

What does . do for RE

Example: beg.n

Any single character

Example: begin, began, begun, beg3n

33

What will the RE .* bring?

Any string of characters.

34

What does ? do for RE

Example: colou?r

Optional previous character, however it cannot be duplicated.

Example: color, colour

35

What does RE ^[A-Z] mean?

Selects capital letters at the front of sentences

Example: Palo Alto. American flag. However, etc....

36

What does RE ^[^A-Za-z] mean?

Anything at the start of a sentence but isn't a letter.
Example: "Hello." or 1.

37

What does the RE \.$ gather?

$ is used for defining a term at the end. \ is used to pull out . ?

Example: The end. The RE gathers periods at the end of sentences.

38

What does the RE .$ gather?

Gathers any ending punctuation at the end of a sentence

Example: The end! The end?

^ The ! and ? are gathered from the above sentences

39

What does the RE [a-z]b$ gather?

Gathers: any word ending with b i.e. only ab, bb, cb, ob etc...

40

What does [a-z]+b$ gather?

Gathers any word ending with b, however more than just two characters. Example: cab, cooob, lab, arab, etc...

41

What is a type 1 error

False positive, matching strings that shouldn't have matched

42

What is a type 2 error

False negative, not matching things that should have matched

43

How to decrease false positive?

How to decrease false negative?

Decrease false positive by increasing accuracy/precision

Decrease false negative by increasing coverage/recall

44

What are the 3 formalisms for capturing language?

Regular expression, finite state automata, regular grammar

45

What is a finite state automata

Mathematical model, process input strings may either accept/reject them, finite amount of information, set of states which change in response to input, rules on how states change (transitions)

46

Draw out the automate for baa+!

Will have q0-q4

q0: b, q1: a, q2: a q3: repeat a, shift to q4 ending in !

In all, has 5 states (q0-q4), b, a, ! in its alphabet, q0 is the start state, q4 is the accept state, and has 5 transitions (arrows)

47

Things required in a FSA

Set of states, finite alphabet, start state, final state, transition function

48

Define recognition

Determine if machine should accept a string, determine string's in language defining the machine. All amount to the same thing in the end

49

What is Turing's notion

involves tape, begin at start state, examine input then consult the tape, go to new state and update the tape pointer until out of a tape.

50

RE gathering all alphabetical string

[a-zA-Z]+

51

RE to gather all lower case alphabet string that end in b

[a-z]*b

52

RE to gather string with 2 consecutive repeat words such as the and the

([a-zA-Z]+)\s+\1

53

RE from alphabet a, b with a preceded and followed by a b

(b+(ab+)+)?

54

RE with grotto or raven in them, not grottos

\bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b

55

What is a lexicon

Vocab/language of a person

56

Morphology define

study of ways that words are built from smaller meaningful units called morphemes

57

What are the two classes of morphemes

Stems-core meaning-bearing unit

Affixes-bits and pieces that adhere to stems, change their meanings and grammatical function

58

Morphological parsing define

recognizing morphemes inside of a word

59

What are the 4 morphological deviations

1) Inflection-stem with grammatical morpheme, same word class (i.e. clean and cleaning)

2) Derivation-" ", yield different word class (clean (v) and cleaning (n).

3) Compounding- combine multiple word stems

4) Cliticization-combination of word stem with a clitic, different words from different syntactic categories, i.e. I've and I+have

60

Examples of regular vs irregular

Regular-walk, walks, walking, walked

Irregular-eat, eats, ate, eaten

61

What is parsing

Find string in language that may be used to assign a structure

62

Production/generation

have structure and want to produce surfaceform

63

Finite state transducer

Finite automaton maps between 2 sets of symbols. Transitions labeled with 2 symbols (input: output). Transducers translate the string

64

Morphological Analysis

May be stand alone or link to further linguistic analysis

65

How to handle ambiguity

1) take first output found

2) find all possible outputs and return all

3) Bias search so only few are explored

66

Tokenization

break text into individual words. Use whitespace and punctuation. May/may not include numbers, dates, may expand clitics

67

Sentence segmentation

converts text segment into individual sentences. Challenge when disambiguate period, "Inc." "Dr." need a list of common abbreviations.

68

Stemming

Reduce terms to stems ex: automat=automatic, automation. Errors found in stemming include: organize, organ, organization. Good for information extraction and computationally efficient

69

Steps of text normalization

1) Tokenize-get words (pull them from text) identify words we're interested in

2) Stemming/Lemmatizing-normalizing words (breaking down words)

3) Segmentation-getting sentences

70

Lemma

same stem, part of speech, word sense= cat and cats, have same lemma

71

Word form

full inflected surface form, cat and cats, different word forms

72

Token

instance of use of some lemma in text

73

Type

refers to a particular lemma

74

How many tokens are found in the sentence: They picnicked by the pool and then lay back on the grass.

12 tokens and 10 types

75

Why is white space not used in tokenization

White space not used in tokenizing because would bring back "said," "cents."

Challenges faced with word-internal punctuation AT&T, Google.com with Clitics what're and I'm and Multi-token words New York. Also not possible with French, German, or Chinese

76

Examples of internal punctuation, clitics, and multi-token

Internal Punctuation: AT&T, Ph.D.

Clitics: I'm, we've

Multi-token: Rock'n'roll

77

Normalization

normalizes terms, deletes periods in terms. Will match USA and U.S.A.

78

Stem Info Retriever

run stemmer, on user queries, matches

79

Porter

no lexicon needed, strip suffixes through sets and recounts doesn't guarantee that resulting stem is a stem

80

Lemmatization

reduce inflection/variant to base, have to find correct dictionary headword form

Example: are, is, am ==>be

car, cars, cars' ==> car

81

What is the focus of NLP

Finding useful features and extracting those features

82

Find minimum edit distance for Intention and Execution

Distance is 8

83

What are the initial and goal state?

Initial-word we're transforming

Goal-final word

84

What is the path cost

Number of edits, want to minimize. Minimum path to each revisited state. For minimum edit distance remember back pointer for each value. Left arrow is insertion. Down is deletion. Diagonal is substitution (+2)

85

What is word prediction

Looks at rest of sentence and attempts to guess the next word

86

What are n-grams

Token sequence of n. Predicts next word from previous. Compute probability of next word similar to computing the probability of sequence of words. Used in automatic speech recognition, and spelling correction

87

What is language modeling

Goal of computing probability of a word, w given some history, h or P(w|h)

88

What is the Markov assumption

Probability of a word depends on probability of limited history

89

Define n-gram

...

90

What is a language model

...

91

How do you train a language model

...

92

How to deal with unknown words?

Making unknown words token train words, not in L change to <UNK> use for words not in the training set

93

How to evaluate n-gram models

Extrinsic evaluation, put model A into speech generator, evaluate performance with model A, put model in and evaluate. Compare application performance with both models

94

Intrinsic evaluation

train parameters on a training set want to know how model does on data is never seen.

95

What is a test set

different from training set, drawn from same source.

96

What is an evaluation metric

how well model is doing on test set

97

Compare pros and cons of intrinsic vs extrinsic

...

98

What is perplexity

Measure of the notion of surprise

Increase surprise, decrease probability assigned to test set

The higher the probability, the less surprised it was

Decrease perplexity, increase probability

99

The lower the perplexity, the ___ the model?

- The lower the perplexity, the better the model (i.e. trigrams)

100

What is an entity

Anything with a name (gene, protein, etc...)

101

Name entity recognition

locate things in natural language text and spec. type

102

Normalization

identify which specific entities text references

103

Extraction

identify entities and relationships between them

104

Ambiguity, give an example

different interpretations of a particular term. In NLP ambiguity can be at lexical, sytactic, or semantic level. I made her duck

105

Lemma define

Canonical (dictionary) form of a word

106

Inflection

word stem with a grammatical morpheme, same word class i.e. clean (verb), clean-ing (verb)

107

named entity

anything with a name

108

compounding

combination of multiple word stems

109

cliticization

combination of a word stem with a clitic, different words from different syntactic categories i.e. I've and I + have

110

Tokenization

break text into individual words using whitespace and punctuation

111

Sentence Segmentation

Converts a segment of text into individual sentences

112

5 levels of linguistic analysis are

Lexical, morphological, syntactic, semantic, pragmatic

113

Goal of NLP

Translate document into some formalism that can be used with a computer. "Represent the meaning of a sentence by logical formations"-Abeed

Human brain=>meanining=>spoken language=>computer=>formal language=>meaning

114

Morphemes can be divided into

Stems and affices

115

A non-deterministic finite state automoa is equivalent to...?

Some deterministic FSA, not all finite state transducers can be determinized

116

How to calculate f-measure

2pr/(p+r) is the harmonic mean of precision and recall

117

What is contained in a document

Information about the world

Opinion about the information

Information about the author

118

Differences between a finite state automata and a finite state transducer

Finite state transducer is a type of finite automation which maps between two sets of symbols. We can visualize an FST as a two-tape automation that recognizes or generates pairs of strings. An FSA is less general and only considers one tape: Officially, it defines a formal language by defining a set of strings

119

How does the porter stemmer work?

A porter stemmer matches the longest suffix and then stems based on that.

  1. Builds on a list of inflectional and derivational suffixes (morphologically recurring strings such as es, ed, ing, ion, ly…) and iteratively matches incoming strings from right to left, based on longest match
  2. If for activation, if both ion and ation are matches, it will choose activ rather than activat as a base form
120

Why not use white-space for tokenizing?

  1. Because a multi-token words like New York is one word but it would be split if white space was used as tokenizing. Also in other languages the white space is not used to separate different words in a sentence.
  2. clitics: Punctuation: I’m won’t be separated
  • Also punctuation (Ph.D.)
121

Explain the meaning of precision and recall

  1. Recall is the sensitivity of the test. Percent of things found of the total it should have found. Precision is the Positive Predictive Value, Percent of things found that are correct.
122

Compute by hand the Levenshtein distance between union and fusion and give the best alignment

  • Levenshtein distance: (substitution between s and n) 2+1 (insertion of f)= 3
  • Substitutions cost 2
  • Deletions cost 1
  • Insertions cost 1

Best alignment:

* U N I O N

| | | | | |

F U S I O N

1 0 2 0 0 0

Levenshtein distance= 1+2=3

123
  1. Given the mini corpus A, compute the probability of the sequence <S> I like eggs </S> using bigrams I would practice Laplace (+1 smoothing too..TS)
    1. Corpus A:
      1. <S> I am Sam</S>
      2. <S> Sam I am</S>
      3. <S> I like green eggs</S>
      4. <S> I like ham too</S>
      5. <S> Sam really like eggs</S>

P(I|<S>)= 3/5

P(like|I)= 2/4

P(eggs|like)=1/3

P(</S>|eggs)=2/2

So probability comes out to: (⅗)*(2/4)*(⅓)*(2/2)= 3/30=1/10

124

What are the three main approaches to Named-Entity Recognition? Pick one of the three and briefly describe the approach and give the pros and cons.

  1. The common approach (i dont think so)
  2. The rule-based approach
  3. The Dictionary approach
  4. Supervised Machine Learning
  • The rule based approach
    • Manually define a set of pattern rules
    • Requires domain experts =expensive
  • Pros: provides explanation with conclusions
  • Cons: performance degrades seriously on text from different sources or purposes
125

Stemmer

  • Stemmer - reduces terms to their stems,
  • Stemming : rule based form of morphological parsing, reduces terms to their stems by cutting off the affixes
126

Porter’s Algorithm Stemmer

- most common english stemmer

127

Smallest unit that makes up a word?

Morpheme

128

What is the Markov assumption

The assumption that the probability of a word depends only on the previous word is called a Markov assumption

129

Shanon’s Method

It's a way to generate a text from the NGram model learned in order to estimate the quality of the model. You need to know that! Chapter 4 NGram part 2 slide 4, presented in class. Discussed in book p°93 and in the lecture online.

130

Smoothing

it's the general term for techniques used to estimate a Ngram probability when the NGram is not observed in the training data

131

NER

I may ask you how you would create such system, please see the review session with the NE extraction/recognition system)

Selection of the corpus

Annotation of the corpus (gold standard)

Extraction of the NEs:

by Dictionary Matching

by Writing rules

by Machine Learning

Evaluation of the performance using the gold standard

132

How does the porter stemmer work?

  • Now how it is doing it: it apply sequentially a finite set of rules which shop iteratively the words affixes. It doesn't use a lexicon (slide 35 for the rules first 1a are apply their outputs are then use with 1b and so one)
133

Named Entity Definition

  • Anything with a name
    • That is a "good definition" You can also insert two or three example (always show that you know more during an exam, my goal is to find if you know the concepts, their properties and their relations) You can also write some part of the history saying than the NLP community started with the name of people/place/organization and numbers (money, date percentage for example) and then the concept has been extended to anything with a name on it (like a collection of object such as the genes SpoIIID or the Toyota Yaris car, diseases etc.)
134

Regular Expression match vs find

[Dd]og

Match: Must match the RE exactly: won’t match dogs

Find (recognize) must match part of the sequence.: will match dogs

  • Match: the RE Must match the input string exactly: [Dd]og won’t match the input string "dogs"
  • Find (recognize) must match part of the input string: will find dog in the input string "dogs"
  • Avoid using "word" since you can apply a RE on a document, DB field, a sentence etc. To be general use string a sequence of character
135

LaPlace smoothing for Bi-Gram Probabilities (from review)):

  • Only use if the one of the N-grams probabilities has a frequency of 0.
  • Add one to every frequency.
  • The denominator of the new probability is the frequency of the original denominator + the number of characters in the sequence you are calculating the probability for.
    • No it's not the number of character in the sequence, it's the size of the vocabulary. In the example in the review session it's 8 because there are 8 words: <S>, I, go, to, school, the, important, </S>. The new frequency obtained is the previous frequency of the unigram observed in the training corpus + 8 the size of the vocabulary. Please be sure to be able to recompute (without watching the answer) the example given in the review, you have also another example in the book.
136

Limitation of LaPlace Smoothing

  • its limitation i.e. what is the problem when you use it see yesterday answer with the recovering frequency answer). In general you apply a smoothing techniques by default since you process a corpus of thousand/million of texts. You don't exactly know which bigrams you will observe and which you will not in your training corpus and test corpus (either because they are impossible because of the grammar or just because nobody never written them). But yes to answer your question the goal is to avoid the frequency of 0 when the bigram is possible. The smoothing works also for any NGrams.
137

Morphology

study of the ways that words are built up from smaller meaningful units called morphemes

138

This is the text version of terms:

stem: congenital affix: ly

stem: dislocat affix: ed

stem: hip affix: s

What are their lexicons?

Stem: congenital affix<none>

Stem: hip affix<none>

Stem: dislocat affix:ion

139

The way that stems and affixes combine is based on?

The world class of the stem. In which we have in mind familiar notions like noun and verbs

140

Morphological parsing

Task of recognizing morphemes inside a word

141

What language processing tasks use morphological parsing?

Machine translation, information retrieval, part of speech tagging

142

Inflectional Morphology

combination of stems and affixes where the resulting word has the same word class as the original

143

Nouns vs Verbs

Nouns are simple, with affixes for plural and possessive

Verbs are complex and affixes appropriate the tense of the verb

144

Regular vs irregular

And examples of each

Words that follow rules and those that don't

Regular: walk, walks, walking, walked

Irregular: eats, eat, eating, ate, eaten

Catch, catches, catching, caught

145

Derivational morphology is used to transform words when there are:

Irregular changes of meaning and Changes of word class

146

We use FSAs to compute what about morphology

Accept strings that are in the language, Reject strings that are not, And do so in a way that doesn’t require us to in effect list all the words in the language

147

How to start off simple with an FSA

Regular singular nouns are ok, Regular plural nouns have an -s on
the end, Irregulars are ok as is

148

Parsing

assign a structure to a string in the language

149

production/generation

having a structure and wanting to produce a surface form for it

150

Finite State Transducer

type of finite automaton which maps between two sets of symbols.

Transducers translate (or transduce) strings.

151

Morphohlogical analysis can either be:

An important stand-alone component of many applications (spelling
correction, information retrieval)
• Or simply a link in a chain of further linguistic analysis

152

In FSTs, the path to an accept state, unlike non-deterministic recognition...

does matter since different paths
represent different parses and
different outputs will result

Recall that in non-deterministic
recognition multiple paths through
a machine may lead to an accept
state.

153

How to deal with ambiguity in FSTs

Simply take the first output found
• Find all the possible outputs (all paths) and return them all (without
choosing)
• Bias the search so that only one or a few likely paths are explored

154

Tokenization

Break text into individual words
 (Mostly) straightforward in English
 – Use whitespace
 – And punctuation

May (or may not) want to include
numbers, dates, collocations
 May want to handle / expand
clitics

155

Sentence segmentation

Converts a segment of text into
individual sentences
 (Mostly) straightforward in English
 What are the punctuation marks used to end a sentence?
 The main difficulty is
disambiguating the period:
 “Dr.” “Inc.” “Ph.D."
 A list of common abbreviations helps considerably
 The best systems employ machine
learning (!S)peech

156

Stemming

A rule-based form of “light”
morphological parsing
 Reduce terms to their stems
 e.g., automate(s), automatic, automation all reduced to automat
 Will make some errors
 E.g. “organization” → “organ”
 No context: “unionizable” → “union” -izable or un- “ion” -izable
 However:
 Computationally efficient
 Very useful for information extraction

157

Describe Porter's algorithm

Step 1a
sses => ss caresses => caress
ies => I ponies => poni
ss => ss caress => caress
s => ø cats=> cat
...
Step 1b
(*v*)ing => ø walking=> walk
sing => sing
(*v*)ed=> ø plastered
=> plaster
– ...
Step 2 (for long stems)
ational=> ate relational=> relate
izer=> ize digitizer => digitize
ator=> ate operator => operate ...
Step 3 (for longer stems)
al => ø revival => reviv
able=> ø adjustable => adjust
ate => ø activate => activ
...

158

Tokenizing

Break text into individual words
 (Mostly) straightforward in English
 – Use whitespace
 – And punctuation

May (or may not) want to include
numbers, dates, collocations
 May want to handle / expand
clitics

159

Stemming/lemmatizing

Normalizing the words

160

Segmentation

Getting sentence

161

Every NLP task needs to do text normalization by

Segmenting/tokenizing words in
running text
2. Normalizing word formats
3. Segmenting sentences in running
text

162

Lemma

Lemma: same stem, part of speech, rough
word sense
 cat and cats = same lemma

163

Wordform

Wordform: the full inflected surface
form
 cat and cats = different wordforms

164

Token/type distinction

A token is an instance of a use of some lemma in a text or
utterance
 A type refers tSope each p anadr Ltaincguuaglea Prr olceesmsinmg - aJurafsky

165

How many n=tokens and v=set of types/vocab in sentence: They picnicked by the pool then lay back on the grass and
looked at the stars

16 tokens
 14 types

166

Maximum Matching Word Segmentation, how to do it and what does it work best with

Given a lexicon of Chinese, and a
string
1) Start a pointer at the beginning of
the string
2) Find the longest word in dictionary
that matches the string starting at
pointer
3) Move the pointer over the word in
string
4) Go to 2

Doesn't work well with english

167

Normalization

We most commonly implicitly define
equivalence classes of terms
 e.g., by deleting periods in a term
 Alternative is to do asymmetric
expansion:
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows, window
 Enter: Windows Search: Windows

168

Stemming for Info Retrieval

Run a stemmer on the documents to
be indexed
 Run a stemmer on users’ queries
 Match
 This is basically a form of hashing, where you’re
looking for collisions.

169

What is a porter

No lexicon needed
 Basically a set of staged sets of
rewrite rules that strip suffixes
 Handles both inflectional and
derivational suffixes
 Doesn’t guarantee that the
resulting stem is really a stem
(see first bullet)
 Lack of guarantee doesn’t matter
for IR

170

Lemmatization

Reduce inflections or variant forms to
base form
 am, are, is => be
 car, cars, car's, cars' => car
 the boy's cars are different colors => the boy car be different color
 Lemmatization: have to find correct
dictionary headword form
 Machine translation
 Spanish quiero (‘I want’), quieres (‘you
want’) same lemma as querer ‘want’

171

Why not use punctuation to find sentences, i.e. break on period space

Because the ‘.’ is ambiguous in
English... and isn’t there at all
in Chinese...

172

To perform Supervise Machine Learning

Train an SVM/Maxent/HMM etc. to disambiguate the
sentence ending markers
 Binary decision task... This period is classified as an
end of sentence marker or not.
 Obviously to do this you need a gold-standard
training set
 And you need to be able to extract features from the
text to be segmentedTrain an SVM/Maxent/HMM etc. to disambiguate the
sentence ending markers
 Binary decision task... This period is classified as an
end of sentence marker or not.
 Obviously to do this you need a gold-standard
training set
 And you need to be able to extract features from the
text to be segmented

173

Surpverise ML in NLP focus

finding useful features and being able to extract those features

174

Minimum edit distance

The minimum edit distance between
two strings is the minimum number
of editing operations
 Insertion
 Deletion
 Substitution
 Needed to transform one string into
the other

175

Edit Distance can be applied to NLP by

Evaluating Machine Translation and Speech Recognition, Named Entity Extraction

176

Alignment

a 1 to 1 pairing of
each element in a sequence with a
corresponding element in the other
sequence or with a gap...

177

During Min Edit Distance keep back pointer bc

 Every time we fill a cell add a pointer back to the
cell that was used to create it (the min cell that
lead to it)
 To get the sequence of operations follow the
backpointer from the final cell
 That’s the same as the alignment.

178

Why would we want to add weights to the min edit distance computation

Spell Correction: some letters are more likely to be mistyped than
others
 Biology: certain kinds of deletions or insertions are more likely than
others

179

N-gram model

N-grams are token sequences of length N.
 2-gram = bigram, 3-gram = trigram
 An N-gram model predicts the next word from the
previous N-1 words
 Computing the probability of the next word is closely
related to computing the probability of a sequence of
words.

180

Applications of N-grams, predicting the next word that will be said

Automatic speech recognition, spelling correction, handwriting recognition, machine translation

181

Counting is not effective because

the use of uh in spoken lanauge, should uh count as token

182

Language Model

statistical model that can assess the probability of a word w given some history h

183

Estimating isn't good for language modeling because

Most text collegctions wont' get good estimate

184

Markov assumption

each component in product replace with approximation

185

N-gram probabilities capture what facts about language

World knowledge, syntax, and discourse

186

Named entity recognition

Locate the entities (things) in natural language and specify their type

187

Entity and example

anything with a name (ex: genes, proteins, diseases, drugs, cell/tissue types)

188

Named entities are useful for

Summarization
– Document classification
– Indexing
– Information retrieval
– Natural language understanding

189

Normalization

identify which specific entities the
text references

190

extraction

identify entities and the relationships between them

191

Difficulties of NER

Ambiguity
– Some words ambiguous with normal text
• E.g. “White” as a gene name
– Some entity types ambiguous with each other
• Human experts often can't tell genes and proteins apart
– Often considered a single type
– Acronyms often have many meanings

Unseen word problem
– Not all names will be in training data, must infer
from context
• “We describe the previously unrecognized gene ABC123”
– Novel name variations: “HERB2” vs “hERB-II”

Boundary problem
– Long names are harder to find
• “human T-cell leukaemia lymphotropic virus type 1 Tax
protein”
– Difficult for a human annotator to remain consistent
– The more descriptive the name, the more it looks
like non-mention text

192

Evaluation of NER

Gold standard – a corpus where human
annotators have already determined the
answers
• Inter-annotator agreement
– The percentage of annotations in agreement out of
the total number of annotations
– Reported values for BioNER range from 75-90%
• Intra-annotator agreement not much better
– Need annotation guidelines
• Detailed, unambiguous

193

Gold Standard

corpus where human
annotators have already determined the
answers

194

Inter-annotator agreement

The percentage of annotations in agreement out of
the total number of annotations

195

Precision formula is

tp / (tp + fp)

percent of things found that are correct

196

recall is?

tp / (tp + fn)

% of things found of the total it should have found

197

What is f-measure

F-measure = 2pr / (p + r)
– Harmonic mean of precision & recall
– Harmonic mean gives a higher penalty for being
out of balance than arithmetic mean

198

Common approach to NER

Tokenization – break the sentence into tokens
(words)
•Tagging – label each token with what type of
entity it represents
•Dictionary
•Rule-based
•Supervised machine learning
•Post-processing – cleanup (if needed)

199

Approaches used for NER

Dictionary-based approach

Rule Based Approach

supervised machine learning

200

Dictionary based NER approach

Match text from a list of entities
– Automatically handle variations
• Capitalization, punctuation, etc...
– Could use approximate string matching
–May confuse entities with normal text (“THE gene”)
or confuse entity type
•Disadvantage: Usually no sufficient dictionary exists
– Hard tradeoff between precision & recall
• Advantage: automatically provides the entity
identification (normalization)

201

Rule based approach

Manually define a set of pattern rules
– E.g. “current token contains digits AND token to
right is 'gene'”
– Hard trade-off between precision & recall
• Requires domain experts = expensive
• Disadvantage performance degrades seriously on text
from different sources or purposes
• Advantage: does provide an explanation for its
conclusions

202

Supervised Machine Learning

Requires training data
– Still need domain experts
– At least their work can be re-used
• Conclusions based on large computations
– Not easily converted into an explanation
• Does make conclusions based on context
• Can return novel entities
• Cons: Takes time to train (possibly days)

203

Post Processing

Often manual rules used to clean up output
• Local abbreviations
– Examples:
• antilymphocyte globulin (ALG)
• ALG (antilymphocyte globulin)
– Ensure both mentions get the same label
• Parenthesis post-processing
– Drop mentions which contain a mismatched
parenthesis
– Extend a mention with a mismatched parenthesis
until matched

204

Parenthesis post processing

Drop mentions which contain a mismatched
parenthesis
– Extend a mention with a mismatched parenthesis
until matched

205

To use ML in NER we must ,it is used b/c

Machine learning is used in all of the topperforming
systems
– A straightforward application of ML usually gives
decent performance by itself
– Best systems often use ML in combination with
other techniques
• To use ML, we need:
– A feature representation
– An ML algorithm

206

Feature Vector Representation

Just the token is not enough input information
for good performance
• Each token is therefore represented by a
vector of features
• These features contain everything we expect to
be relevant about that token

207

Syntatctic features

Part of speech

208

Semantic features

Gazetteers, lexicons, dictionaries
– Use a trie to label input based on dictionary
content
• Handle multiple tokens
• Pass this labeling as an input feature

209

Feature window

Allow some/all features from adjacent tokens
to be accessed via an index (-2, -1, 0, +1, +2)
– E.g. “Token='gene'+1” would be a strong feature

210

1st order Markov property

probability of next
state only depends on current state

211

Nth order-Markov property

probability of next
state only depends on last N states. Can model nth-order process as a first order process

212

Hidden Markov Models

Sequence length of 1 would be equivalent to
Naïve Bayes classifier
– Assumes that features are independent of each
other
– Gives equal weight to each feature

213

Conditional Random Fields

Linear-chain CRFs are similar to HMMs
• Sequence length of 1 would be equivalent to
Logistic Regression
– Does not assume features are independent of each
other
– Each feature is weighted differently
• Becoming more widely used
– For NER, more often used than HMMs

214

Why don't we evaluate a system with same data used to train it

Solution #1: Training and test sets
– Use one set specifically for training, one for
evaluation
• Solution #2: Cross-validation
– 10-fold cross-validation: Split data into 10 parts;
train on 9, test on 1; repeat for each part; average
– 5x2 cross-validation: Split data into 2 parts; train on
one, test on other; repeat 5 times; & average

215

Feature selection

“Best” results requires tuning the feature set to
the data
– Different datasets will have a different set of best
features
• Feature selection selects the best-performing
subset
– Typically performed by iteratively dropping the
least informative features
– Typically performed manually (automated methods
do exist)

216

Evaluating individual updates requires the use of

leave-one-out evaluation
– E.g. Show AB < ABC, AC < ABC, BC < ABC
– Demonstrates configuration is at least a local
maximum

217

Biomedical Text Mining Applications

Adverse Drug Extraction, Relationship Extraction (Gene-disease relation), sentiment analysis (emotion extraction) NER

218

Tokenization

Identifying words from a given document stream

219

Stemming

Assignment of a canonical base form

220

Porter algorithm

Builds on a list of inflectional and derivational suffixes
(morphologically recurring strings such as es, ed, ing, ion,
ly…) and iteratively matches incoming strings from right
to left, based on longest match
■ Ex. If for activation, if both ion and ation are matches, it
will choose activ rather than activat as a base form

221

Text Mining Pipeline

□ Recognizing words in text is only the first step
□ Information Retrieval deals with the problem of finding
relevant documents in response to a specific information need
(for example, publications relevant to drugs that impact IL6)
□ Information Extraction requires methods to find biomedical
entities of interest and associations among them
□ Thus, we jump from finding tokens, to finding meaningful
groups of words, and then related ideas…

222

Information Retrieval

deals with the problem of finding
relevant documents in response to a specific information need
(for example, publications relevant to drugs that impact IL6)

223

Information Extraction

requires methods to find biomedical
entities of interest and associations among them

224

NER

Extract entities of
interest (such as genes, drugs, and diseases) from
biomedical literature

A basic building block of almost all other automatic
extraction sub-tasks.
□ Deals with the problem of finding references to entities
(mentions) present in text, and tagging them with their
location and type.
□ Generally considered more difficult in biomedical domain
□ An open source NER engine, BANNER (Leaman 2007)
recognizes genes and diseases
□ LINNAEUS is available for species (Gerner 2010)

225

Normalization

Map entities to their standard id’s. Identify associations between them (such as proteinprotein,
gene-drug, or gene-disease)

Helps link objects of potential interest, such as genes, to
detailed information not contained in a publication (such as
the Entrez Gene identifier)
□ Gene mention normalization (GN) is particularly challenging
given the high ambiguity of gene names: they refer to
orthologous or entirely different genes, are named after
phenotypes and other biomedical terms, or they resemble
common English words.
□ Two systems employing largely dictionary-based approaches
to normalize gene names are described in (Hakenberg 2008,
2011) and (Wermter 2009).

226

Association Extraction

One of the highest level tasks still considered a purely
extraction sub-task, but key to inferring new knowledge.
□ Uses the output of prior sub-tasks to produce a list of
associations among the different entities of interest.
□ Catalysts for advances in this area have been the
Biocreative and BioNLP, i2b2 shared tasks, with teams
from around the world putting their systems to the test
against carefully annotated datasets.
□ A survey of submissions to Biocreative III (Krallinger
2011) and BioNLP (Kim 2009, 2011) convey a good
overview of current approaches.

227

Regular expression

Formal language for specifying text strings. formula in a special language that is used for specifying a simple class of string.
 Formally, a regular expression is an algebraic
notation for characterizing a set of strings.
 RE search requires
 a pattern that we want to search for, and
 a corpus of texts to search through.

228

*

zero or more of the preceding character

229

+

one or more of the preceding character

230

.

any (single) character

231

.*

any string of characters

232

?

optional previous character

233

$

end of sentence

234

^

front of sentence

235

[^]

exclude blank...whatever letter/term

236

Which of the following matches the
regular expression /\.(edu|gov)$/
a) john.levin@asu.edu
b) dave.johnson@test.com
c) samantha.johnson@nih.gov

a and c

237

Write a regular expressions for the set of all
lower case alphabetic strings ending in a b

[a-z]+b$

238

Aliases

Use aliases to designate particular recurrent sets of characters
 \d [0-9]: digit
 \D [^\d]: non-digit
 \w [a-zA-Z0-9\_]: alphanumeric
 \W [^\w]: non-alphanumeric
 \s whitespace character
 \S [^\s]: non-whitespace
 \b boundaries

239

Which of the following matches RE
/[a-z][\.\?!]\s+[A-Z]/
a. A. B
b. c! d
c. e f
d. g. H
e. i? J

d and e

240

Statement for white space?

\s+

241

Type 1 error

FP, matching strings we shouldn't have matched

242

Type 2 error

FN, not matching things that should have matched

243

Minimize FP by

Increasing accuracy, or precision,

244

Minimize FN by

Increasing coverage, or recall,

245

Formalisms for capturing language

Regular expressions (compact text strings)
 Finite state automata (“dynamic” graphs)
 Regular grammars (rules)

246

FSA

FSA is a mathematical model (can be
modeled as graph or a table)
 Its job is to process input strings and accept or
reject them
 Stores only a finite amount of information
 FSA has a set of states
 A state changes in response to its input
 The rules that say how the states change
based on inputs are transitions

247

Recognition

Recognition is the process of determining if a
string should be accepted by a machine
 Or… it’s the process of determining if a string
is in the language we’re defining with the
machine
 Or… it’s the process of determining if a regular
expression matches a string
 Those all amount the same thing in the end

248

Shannon Method

Use a model to generate random sentences that are like the sentences from which the model was derived. This allows us to visualize the language model

249

How to deal with unknown words

Create an unknown word token <UNK>
 Training of <UNK> probabilities
 Create a fixed lexicon L, of size V
 From a dictionary or
 A subset of terms from the training set
 At text normalization phase, any training word not in L
changed to <UNK>
 Now we count that like a normal word
 At test time
 Use UNK counts for any word not in training

250

Model evaluation

How do we know if our models are any good? How
do we know if one model is better than another?
 Shannon’s game gives us an intuition:
 The generated texts from the higher order models sure
“look” better: they sound more like the text the model was
obtained from.
 The generated texts from the WSJ and Shakespeare
models look different: they look like they’re based on
different underlying models.

251

Extrinsic evaluation

Put model A into an application
 For example, a speech generator
 Evaluate the performance of the application with model A
 Put model B into the application and evaluate
 Compare performance of the application with the two models

252

Intrinsic evaluation

Train parameters of the model on a training set.
 Look at the models performance on new data: want
to know how the model performs on data it hasn’t
seen.
 What new data? A test set: a dataset which is
different than our training set, but is drawn from the
same source.
 Then we need an evaluation metric to tell us how
well our model is doing on the test set: perplexity
(there are others)

253

Comparing extrinsic and intrinsic. Pros and Cons of each

Extrinsic evaluation
 Is really time-consuming
 Can take days to run an experiment
 It is a good measure of system performance
 Intrinsic evaluation
 Faster, but a temporary solution, in order to run
experiments
 Perplexity is a poor approximation unless the test data
looks just like the training data
 Generally only useful in pilot experiments (not sufficient
to publish)
 But is helpful as a start.

254

Perplexity

The intuition behind perplexity as a measure is the notion of surprise.
 How surprised is the language model when it sees
the test set?
 Where surprise is a measure of...
 Gee, I didn’t see that coming...
 The more surprised the model is, the lower the
probability it assigned to the test set
 The higher the probability, the less surprised it was

**Minimizing perplexity is the same as maximizing
probability: the best language model is one that best
predicts an unseen test set

255

Minimizing perplexity is the same as

maximizing probability: the best language model is one that best
predicts an unseen test set. Lower perplexity means a better model.

256

Zipf's Law

A small number of events occur with high frequency
 A large number of events occur with low frequency

257

Laplace Smoothing

Add one to all counts when instance of 0 is presented. Used in document classification