Results of Our Extraction


A Discussion

Weighting Scheme 1

Our first weighting system based 50% of an affiliations total score on the distance to name and length of affiliation name scores. We theorized that people will mention their most important/current affiliations either just before, or, just after introducing themselves. We also believed that a longer, more specific name also made an affiliation more likely to be important. The top words and POS tags to the left are weighted more heavily than those to the right. They have a split of 35:15 (7:3) of the remaining 50% of the score. Words to the left are weighted more heavily because they often give a better indication of semantic sentance structure than those to the right.

We tested our program on 50 hand tagged utterances from our collection. Weighting Scheme 1 was only run using a list of affiliations containing Californian cities and the uncurated dd_organizations list. Runs with this first weighting scheme produced an average accuracy of roughly 62%.

Weighting Scheme 2

For our second weighting scheme we increased the weight of name distance and name length upto 60%, up from 50% in the prior scheme. The weight ratio between left hand side words/POS tags and the right hand side remained proportionally the same at 28:12 (7:3) in the remaining 40%.

We tested our program on the same 50 hand tagged utterances from our collection. Weighting Scheme 2 was run using the curated dd_organizations list as well as the list of Californian cities. The second weighting scheme produced an average accuracy of 76%, up 14% from our original scheme. Some of this accuracy boost may be attributed to the curation of dd_organizations.

Word Count
california112
state31
health17
education13
finance12
vote12
foster youth9
up8
sacramento8
angeles8
private7
board of supervisors7
los angeles7

This table represents the total number of all the affiliations in descending order. This is consistent with what we were expecting as there were many common chapter names, such as "Education" or "Health" in the utterences and many of the common chpaters represents a political issue or social program that the speakers are trying to fight for or against. A lot of popular cities in California are in the top affiliations with "Sacramento" being the top affiliation since it is the capital of California and thus tends to have more government related programs. "California" is the top affiliation which is what we expected since all of these uttereances were located in California.


Distance to Name Count
-1247
6119
5112
394
491
767
862
947
1034
1129
119
1617
1416
1316
1216
1811
1711
2110
159
368

This table represents the total number of distances from the affiliation to the speaker's name. This feature is important as we analysed the data and realized when someone says their name, they are usually introducing themselves and thus are about to declare their purpose for coming. Since we weighed distance heavily, one might think that the top distances would print in descending order however, "6" is the top distance which is clearly larger than the next 3, "5, 4, 4" and the smallest distance, "1", appears farther down the top distances. This means that range from 3-11 is the sweet spot which does make sense as usually there are conjunctions in between the words indicating that the affiliation is about to appear.


Left Word Count
the367
of323
,152
with123
behalf81
in52
representing46
california44
for43
from36
and34
a28
'm25
.23
NULL22
association22
county19
to18
department14

The most common left 2 words of the affiliation list is found in this table with the 'NULL' value meaning that this affiliation did not have a left value as it could have been found in the beginning of the utterance. The word "The" is actually the most common word which is possible as most of the time when introducing a proper noun such as an affiliation, "The" is used to indicate that it is a sole entity. An interesting word found in the top left words is the comma which can not usually solely detrmine if the next word will be an affiliation. However, in the utterances many speakers list affiliations they belong to and thus the comma is used in text to indicate a listing.


Left POS Tag Count
IN636
NN620
DT411
,152
JJ146
NNS73
VBG61
VBP52
CC35
.24
RB24
NULL22
TO19
VB18
VBZ14
PRP$12
PRP10

This is a table representing the top left 2 parts of speech tags found which have some significant value as prepositions "IN", nouns "NN" and "the" "DT" are clearly found more commonly than the other parts of speech by a very large margin. This is consistent with the data from the top right words table as almost all of those words fall under one of the 3 categories: "IN", "NN", "DT". Finding a lot of prepositions is what we expected as prepositions describe nouns and all of our affiliations are either nouns or pronouns. Some of the other types of parts of speech found might be affiliations that were tagged incorrectly thus skewing the data. The list of parts of speech and their acronyms can be found here .


Right Words Count
.366
,321
in210
and143
we118
support97
i87
the61
NULL60
california40
also35
on28
is22
's22
strong19
with19
for18
of18
to17
behalf16
opposed14

These are the most common right 2 words of the affiliations in descending order. The biggest difference is that a lot more punctuation can be found on the right side of the affiliation meaning that most of the time the sentance ended. We made sure that when we found a "." we would ignore everything afterwards since the ending of a sentance means that nothing else of context would be found. This also meant that right words is usually not giving us too much contextual information about the sentance. However, did find the words "and" and "," meaning that the affiliation belonged in a list and thus everything in that list should be taken into consideration and weighed equally. Also the word "in" is found a lot in both the right words and left words since most of the time the utterances were back to back organization to city affiliations such as "ACLU in Sacramento".


Right POS Tag Count
NN447
.371
IN328
,321
CC150
PRP131
RB101
JJ100
DT94
NULL60
NNS56
VB30
VBZ27
VBG24
POS22
VBN19
TO17

This table represents all the right 2 parts of speech tags of the affiliation. This table is not as significant however, for the purposes of comparing the left words and the right words, this table was added. As expected, the "." punctuation is commonly found amongst the right words and also the word "in" which is represented by the part of speech tag "IN". The right pronouns are consistent with what we found in the right words except for the "NN" category representing nouns meaning that if we do find something other than the special words, it is usually a noun which does not give too much contextual information about the affiliation except that maybe more affiliations could possibly be found but were ranked lower. The biggest difference between the left and right parts of speech list is that "DD" cannot be found in the top right parts of speech which indicates that the word "the" is not common. This is probably the biggest difference and could be used to distinguish if something is an affiliation or not.