Weighting Scheme 1
Our first weighting system based 50% of an affiliations total score on the distance to name and length of affiliation name scores. We theorized that people will mention their most important/current affiliations either just before, or, just after introducing themselves. We also believed that a longer, more specific name also made an affiliation more likely to be important. The top words and POS tags to the left are weighted more heavily than those to the right. They have a split of 35:15 (7:3) of the remaining 50% of the score. Words to the left are weighted more heavily because they often give a better indication of semantic sentance structure than those to the right.
We tested our program on 50 hand tagged utterances from our collection. Weighting Scheme 1 was only run using a list of affiliations containing Californian cities and the uncurated dd_organizations list. Runs with this first weighting scheme produced an average accuracy of roughly 62%.
Weighting Scheme 2
For our second weighting scheme we increased the weight of name distance and name length upto 60%, up from 50% in the prior scheme. The weight ratio between left hand side words/POS tags and the right hand side remained proportionally the same at 28:12 (7:3) in the remaining 40%.
We tested our program on the same 50 hand tagged utterances from our collection. Weighting Scheme 2 was run using the curated dd_organizations list as well as the list of Californian cities. The second weighting scheme produced an average accuracy of 76%, up 14% from our original scheme. Some of this accuracy boost may be attributed to the curation of dd_organizations.
Word | Count |
---|---|
california | 112 |
state | 31 |
health | 17 |
education | 13 |
finance | 12 |
vote | 12 |
foster youth | 9 |
up | 8 |
sacramento | 8 |
angeles | 8 |
private | 7 |
board of supervisors | 7 |
los angeles | 7 |
This table represents the total number of all the affiliations in descending order. This is consistent with what we were expecting as there were many common chapter names, such as "Education" or "Health" in the utterences and many of the common chpaters represents a political issue or social program that the speakers are trying to fight for or against. A lot of popular cities in California are in the top affiliations with "Sacramento" being the top affiliation since it is the capital of California and thus tends to have more government related programs. "California" is the top affiliation which is what we expected since all of these uttereances were located in California.
Distance to Name | Count |
---|---|
-1 | 247 |
6 | 119 |
5 | 112 |
3 | 94 |
4 | 91 |
7 | 67 |
8 | 62 |
9 | 47 |
10 | 34 |
11 | 29 |
1 | 19 |
16 | 17 |
14 | 16 |
13 | 16 |
12 | 16 |
18 | 11 |
17 | 11 |
21 | 10 |
15 | 9 |
36 | 8 |
This table represents the total number of distances from the affiliation to the speaker's name. This feature is important as we analysed the data and realized when someone says their name, they are usually introducing themselves and thus are about to declare their purpose for coming. Since we weighed distance heavily, one might think that the top distances would print in descending order however, "6" is the top distance which is clearly larger than the next 3, "5, 4, 4" and the smallest distance, "1", appears farther down the top distances. This means that range from 3-11 is the sweet spot which does make sense as usually there are conjunctions in between the words indicating that the affiliation is about to appear.
Left Word | Count |
---|---|
the | 367 |
of | 323 |
, | 152 |
with | 123 |
behalf | 81 |
in | 52 |
representing | 46 |
california | 44 |
for | 43 |
from | 36 |
and | 34 |
a | 28 |
'm | 25 |
. | 23 |
NULL | 22 |
association | 22 |
county | 19 |
to | 18 |
department | 14 |
The most common left 2 words of the affiliation list is found in this table with the 'NULL' value meaning that this affiliation did not have a left value as it could have been found in the beginning of the utterance. The word "The" is actually the most common word which is possible as most of the time when introducing a proper noun such as an affiliation, "The" is used to indicate that it is a sole entity. An interesting word found in the top left words is the comma which can not usually solely detrmine if the next word will be an affiliation. However, in the utterances many speakers list affiliations they belong to and thus the comma is used in text to indicate a listing.
Left POS Tag | Count |
---|---|
IN | 636 |
NN | 620 |
DT | 411 |
, | 152 |
JJ | 146 |
NNS | 73 |
VBG | 61 |
VBP | 52 |
CC | 35 |
. | 24 |
RB | 24 |
NULL | 22 |
TO | 19 |
VB | 18 |
VBZ | 14 |
PRP$ | 12 |
PRP | 10 |
This is a table representing the top left 2 parts of speech tags found which have some significant value as prepositions "IN", nouns "NN" and "the" "DT" are clearly found more commonly than the other parts of speech by a very large margin. This is consistent with the data from the top right words table as almost all of those words fall under one of the 3 categories: "IN", "NN", "DT". Finding a lot of prepositions is what we expected as prepositions describe nouns and all of our affiliations are either nouns or pronouns. Some of the other types of parts of speech found might be affiliations that were tagged incorrectly thus skewing the data. The list of parts of speech and their acronyms can be found here .
Right Words | Count |
---|---|
. | 366 |
, | 321 |
in | 210 |
and | 143 |
we | 118 |
support | 97 |
i | 87 |
the | 61 |
NULL | 60 |
california | 40 |
also | 35 |
on | 28 |
is | 22 |
's | 22 |
strong | 19 |
with | 19 |
for | 18 |
of | 18 |
to | 17 |
behalf | 16 |
opposed | 14 |
These are the most common right 2 words of the affiliations in descending order. The biggest difference is that a lot more punctuation can be found on the right side of the affiliation meaning that most of the time the sentance ended. We made sure that when we found a "." we would ignore everything afterwards since the ending of a sentance means that nothing else of context would be found. This also meant that right words is usually not giving us too much contextual information about the sentance. However, did find the words "and" and "," meaning that the affiliation belonged in a list and thus everything in that list should be taken into consideration and weighed equally. Also the word "in" is found a lot in both the right words and left words since most of the time the utterances were back to back organization to city affiliations such as "ACLU in Sacramento".
Right POS Tag | Count |
---|---|
NN | 447 |
. | 371 |
IN | 328 |
, | 321 |
CC | 150 |
PRP | 131 |
RB | 101 |
JJ | 100 |
DT | 94 |
NULL | 60 |
NNS | 56 |
VB | 30 |
VBZ | 27 |
VBG | 24 |
POS | 22 |
VBN | 19 |
TO | 17 |
This table represents all the right 2 parts of speech tags of the affiliation. This table is not as significant however, for the purposes of comparing the left words and the right words, this table was added. As expected, the "." punctuation is commonly found amongst the right words and also the word "in" which is represented by the part of speech tag "IN". The right pronouns are consistent with what we found in the right words except for the "NN" category representing nouns meaning that if we do find something other than the special words, it is usually a noun which does not give too much contextual information about the affiliation except that maybe more affiliations could possibly be found but were ranked lower. The biggest difference between the left and right parts of speech list is that "DD" cannot be found in the top right parts of speech which indicates that the word "the" is not common. This is probably the biggest difference and could be used to distinguish if something is an affiliation or not.