Maximum Entropy Model (MEM) for Part-Of-Speech Tagging

Artificial Intelligence/NLP

Maximum Entropy Model (MEM) for Part-Of-Speech Tagging

테리는당근을좋아해 2024. 7. 16. 22:03

Maximum Entropy Model (MEM) for Part-Of-Speech Tagging

1. Maximum Entropy Model 개요

1) Maximum Entropy Model의 원리

주어진 제약 조건을 만족하면서 정보 엔트로피를 최대화하는 분포는 선택하는 것으로, 이는 가용 정보 외에 불필요한 가정을 하지 않기 위해서이다.

2) Maximum Entropy Model (최대 엔트로피 모델)

최대 엔트로피 모델에서 조건부 확률 $P(y|x)$는 입력 $x$(문맥, 단어 등) 주어진 상황에서 출력 $y$(품사 태그)의 확률을 나타내며 수식은 아래와 같다.

$P(y|x) = \frac{1}{Z(x)}exp(\sum_{i}\lambda_{i}f_{i}(x, y))$

- $f_{i}(x,y)$는 특성 함수로, 입력 단어와 출력 품사의 조합이 얼마나 일치하는지 나타내는 이진 값이다.

- $\lambda_{i}$는 특성 함수 $f_{i}$의 가중치로, 모델 학습을 통해 결정된다.

- $Z(x)$는 정규화 상수로, 모든 가능한 품사 태그 y에 대해 확률 값이 1이 되도록 보장한다.

- $Z(x)=\sum_{y'} exp(\sum_{i}\lambda_{i}f_{i}(x,y'))$

2. Feature Function (특성 함수)

특성 함수는 입력 단어와 출력 품사 태그 간의 관계를 설명한다. 아래와 같은 형태를 가질 수 있으며, 조건이 성립하면 1, 그렇지 않으면 0을 반환한다.

단어 x가 특정 접미사를 가지고 있고, 품사 태그 y가 명사일 때
단어 x 주변에 특정 다른 단어가 위치하고 있고, 품사 태그 y가 동사일 때

1) 특성 함수가 모델에서 어떻게 동작할까

"The quick brown fox jumps"라는 단어 시퀀스를 가지고 특성 함수가 모델에서 동작하는 원리를 단계별로 보면 아래와 같다.

a. 토큰화

words = ["The", "quick", "brown", "fox", "jumps"]

b. 품사 태그 정의

tags = ["DT", "JJ", "NN", "VBZ", "NNS"]

c. Feature Function 정의

def foo_1(word, tag):
    return 1 if word == "The" and tag == "DT" else 0

def foo_2(word, tag):
    return 1 if word[0].isupper() and tag == "NNP" else 0

def foo_3(prev_word, word, tag):
    return 1 if prev_word == "The" and tag == "JJ" else 0

d. "The"에 대한 특성 값 계산

word = "The"
prev_word = None  # 첫 번째 단어이므로 이전 단어가 없음

for tag in tags:
    feature_values = {
        "foo_1": foo_1(word, tag),
        "foo_2": foo_2(word, tag),
        "foo_3": foo_3(prev_word, word, tag) if prev_word else 0,
    }
    print(f"Word: {word}, Tag: {tag}, Feature Values: {feature_values}")

출력

Word: The, Tag: DT, Feature Values: {'foo_1': 1, 'foo_2': 0, 'foo_3': 0}
Word: The, Tag: JJ, Feature Values: {'foo_1': 0, 'foo_2': 0, 'foo_3': 0}
Word: The, Tag: NN, Feature Values: {'foo_1': 0, 'foo_2': 0, 'foo_3': 0}
Word: The, Tag: VBZ, Feature Values: {'foo_1': 0, 'foo_2': 0, 'foo_3': 0}
Word: The, Tag: NNS, Feature Values: {'foo_1': 0, 'foo_2': 0, 'foo_3': 0}

e. 가중치 적용 (학습된 가중치가 아래와 같다고 가정)

weights = {
    "foo_1": 2.0,
    "foo_2": 1.0,
    "foo_3": 1.5,
}

def calculate_score(feature_values, weights):
    score = 0.0
    for feature, value in feature_values.items():
        score += value * weights[feature]
    return score

for tag in tags:
    feature_values = {
        "foo_1": foo_1(word, tag),
        "foo_2": foo_2(word, tag),
        "foo_3": foo_3(prev_word, word, tag) if prev_word else 0,
    }
    score = calculate_score(feature_values, weights)
    print(f"Word: {word}, Tag: {tag}, Score: {score}")

출력

Word: The, Tag: DT, Score: 2.0
Word: The, Tag: JJ, Score: 0.0
Word: The, Tag: NN, Score: 0.0
Word: The, Tag: VBZ, Score: 0.0
Word: The, Tag: NNS, Score: 0.0

f. 태그 선택

best_tag = None
best_score = float('-inf')

for tag in tags:
    feature_values = {
        "foo_1": foo_1(word, tag),
        "foo_2": foo_2(word, tag),
        "foo_3": foo_3(prev_word, word, tag) if prev_word else 0,
    }
    score = calculate_score(feature_values, weights)
    if score > best_score:
        best_score = score
        best_tag = tag

print(f"Best tag for word '{word}' is '{best_tag}' with score {best_score}")

출력

Best tag for word 'The' is 'DT' with score 2.0

위의 과정으로 입력 단어 "The"에 대한 품사 태그로 "DT" 출력하게되고, 각 단어에 대해 위 과정을 반복적으로 수행해서 태그 시퀀스를 출력한다.

2) 특성 함수는 어떻게 정의하면 될까

특성 함수의 정의는 모델이 데이터를 효과적으로 학습하고 예측할 수 있도록하는 중요한 요소이다. 특성 함수를 정의하기 위해서 고려해야할 사항으로 도메인 지식, 문맥 정보, 위치 저보, 결합 특성 등이 있다.

a. 도메인 지식

# 단어의 특정 철자 패턴을 가지는 경우 (동명사)
def ends_with_ing(word, tag):
    return 1 if word.endswith("ing") and tag == "VBG" else 0

# 첫 글자가 대문자인 경우 (고유명사)
def starts_with_capital(word, tag):
    return 1 if word[0].isupper() and tag == "NNP" else 0

b. 문맥 정보

# 이전 단어에 따라 품사 태그가 달라질 수 있다.
def prev_word_is_the(prev_word, tag):
    return 1 if prev_word == "the" and tag == "NN" else 0

c. 위치 정보

# 문장의 첫 단어는 주로 고유명사일 가능성이 있다.
def is_first_word(index, tag):
    return 1 if index == 0 and tag == "NNP" else 0

d. 결합 특성

# 여러 가지 단일 특성을 결합해야할 때
def complex_feature(prev_word, word, next_word, tag):
    return 1 if prev_word == "the" and next_word == "cat" and word == "quick" and tag == "JJ" else 0

3. Model Learning (모델 학습)

1) 로그 우도 함수

가중치 $\lambda_{i}$를 학습하기 위해, 주어진 데이터셋 $D={(x^{1},y^{1}), (x^{2},y^{2}), ..., (x^{N},y^{N})}$에 대한 로그 우도 함수를 최대화한다.

$ logL(\lambda)=\sum_{j=1}^{N}logP(y^{j}|x^{j})$

위의 로그 우도 함수에 $P(y|x)$를 대입하면, 로그 우도 함수는 다음과 같다.

$logL(\lambda)=\sum_{j=1}^{N}(\sum_{i}\lambda_{i}f_{i}(x^{i},y^{i})-logZ(x^{j}))$

$logL(\lambda)=\sum_{j=1}^{N}(\sum_{i}\lambda_{i}f_{i}(x^{i},y^{i})-log\sum_{y'}exp(\sum_{i}\lambda _{i}f_{i}(x^{j},y')))$

2) 최적화

로그 우도 함수를 최대화하기 위해 로그 우도 함수의 기울기를 계산하면 아래와 같다.

$\frac{\partial logL(\lambda )}{\partial \lambda_{i}} = \sum_{j=1}^{N}(f_{i}(x^{j},y^{j}) - \sum_{y'}P(y'|x^{j})f_{i}(x^{j}, y'))$

- $f_{i}(x^{j},y^{j})$ : 실제 데이터에서 관찰된 특성함수의 값

- $\sum_{y'}P(y'|x^{j})f_{i}(x^{j}, y')$: 모델이 예측한 특성 함수의 기대값

경사하강법을 통해 각 가중치 $\lambda_{i}$를 업데이트 으로 최적의 가중치를 구할 수 있다. ($\eta$는 learning rate)

$\lambda_{i} \leftarrow \lambda_{i} + \eta \frac{\partial logL(\lambda )}{\partial \lambda_{i}}$

4. 최적화 방법

반복적인 방법(예: GIS, Generalizaed Iterative Scaling 또는 IIS(Improved Iterative Scaling)을 사용해 파라미터를 업데이트 한다.

5. 장점과 단점

(+) 유연한 특성 정의가 가능하고, 다양한 정보를 통합할 수 있다.
(-) 많은 특성을 포함하는 경우, 학습이 느려질 수 있는 과적합에 위험이 있다.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

sentences = [
    [("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"), ("jumps", "VBZ")],
    [("The", "DT"), ("lazy", "JJ"), ("dog", "NN"), ("sleeps", "VBZ")],
    [("The", "DT"), ("dog", "NN"), ("runs", "VBZ")],
]

def feature_function(sentence, index):
    word = sentence[index][0]
    prev_word = "" if index == 0 else sentence[index - 1][0]
    next_word = "" if index == len(sentence) - 1 else sentence[index + 1][0]

    features = {
        'word': word,
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'prev_word': prev_word,
        'next_word': next_word,
        'is_capitalized': word[0].isupper(),
        'is_all_caps': word.isupper(),
        'is_all_lower': word.islower(),
        'prefix-1': word[0],
        'prefix-2': word[:2],
        'prefix-3': word[:3],
        'suffix-1': word[-1],
        'suffix-2': word[-2:],
        'suffix-3': word[-3:],
        'prev_word_is_the': prev_word.lower() == 'the',
    }
    return features

X = []
y = []
for sentence in sentences:
    for index in range(len(sentence)):
        X.append(feature_function(sentence, index))
        y.append(sentence[index][1])

vectorizer = DictVectorizer(sparse=False)
label_encoder = LabelEncoder()

X = vectorizer.fit_transform(X)
y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

def predict_pos_tags(sentence):
    features = [feature_function(sentence, index) for index in range(len(sentence))]
    X_new = vectorizer.transform(features)
    y_pred = model.predict(X_new)
    return [label_encoder.inverse_transform([tag])[0] for tag in y_pred]

new_sentence = [("The", ""), ("quick", ""), ("brown", ""), ("fox", ""), ("jumps", "")]
predicted_tags = predict_pos_tags(new_sentence)
print(list(zip([word for word, _ in new_sentence], predicted_tags)))

출력

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ')]

저작자표시