更多实时更新的个人学习笔记分享,请关注:
知乎:https://www.zhihu.com/people/yuquanle/columns
微信订阅号:AI小白入门
ID: StudyForAI
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.CONLL_03).downsample(0.1)
print(corpus)
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)
# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [
WordEmbeddings('glove'),
# comment in this line to use character embeddings
# CharacterEmbeddings(),
# comment in these lines to use flair embeddings
# FlairEmbeddings('news-forward'),
# FlairEmbeddings('news-backward'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
# 5. initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
use_crf=True)
# 6. initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# 7. start training
trainer.train('resources/taggers/example-ner',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150)
# 8. plot training curves (optional)
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('resources/taggers/example-ner/loss.tsv')
plotter.plot_weights('resources/taggers/example-ner/weights.txt')
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.AG_NEWS, 'path/to/data/folder').downsample(0.1)
# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()
# 3. make a list of word embeddings
word_embeddings = [WordEmbeddings('glove'),
# comment in flair embeddings for state-of-the-art results
# FlairEmbeddings('news-forward'),
# FlairEmbeddings('news-backward'),
]
# 4. init document embedding by passing list of word embeddings
document_embeddings: DocumentLSTMEmbeddings = DocumentLSTMEmbeddings(word_embeddings,
hidden_size=512,
reproject_words=True,
reproject_words_dimension=256,
)
# 5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=False)
# 6. initialize the text classifier trainer
trainer = ModelTrainer(classifier, corpus)
# 7. start the training
trainer.train('resources/taggers/ag_news',
learning_rate=0.1,
mini_batch_size=32,
anneal_factor=0.5,
patience=5,
max_epochs=150)
# 8. plot training curves (optional)
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('resources/taggers/ag_news/loss.tsv')
plotter.plot_weights('resources/taggers/ag_news/weights.txt')
训练模型后,您可以加载它来预测新句子的类别。 只需调用模型的预测(predict)方法即可。
classifier = TextClassifier.load_from_file('resources/taggers/ag_news/final-model.pt')
# create example sentence
sentence = Sentence('France is the current world cup winner.')
# predict tags and print
classifier.predict(sentence)
print(sentence.labels)
from typing import List
from flair.data import MultiCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import FlairEmbeddings, TokenEmbeddings, StackedEmbeddings
from flair.training_utils import EvaluationMetric
# 1. get the corpora - English and German UD
corpus: MultiCorpus = NLPTaskDataFetcher.load_corpora([NLPTask.UD_ENGLISH, NLPTask.UD_GERMAN]).downsample(0.1)
# 2. what tag do we want to predict?
tag_type = 'upos'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)
# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [
# we use multilingual Flair embeddings in this task
FlairEmbeddings('multi-forward'),
FlairEmbeddings('multi-backward'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
# 5. initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
use_crf=True)
# 6. initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# 7. start training
trainer.train('resources/taggers/example-universal-pos',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
evaluation_metric=EvaluationMetric.MICRO_ACCURACY)
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('loss.tsv')
plotter.plot_weights('weights.txt')
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.CONLL_03).downsample(0.1)
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [
WordEmbeddings('glove')
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
# 5. initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
use_crf=True)
# 6. initialize trainer
from flair.trainers import ModelTrainer
from flair.training_utils import EvaluationMetric
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# 7. start training
trainer.train('resources/taggers/example-ner',
EvaluationMetric.MICRO_F1_SCORE,
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
checkpoint=True)
# 8. stop training at any point
# 9. continue trainer at later point
from pathlib import Path
trainer = ModelTrainer.load_from_checkpoint(Path('resources/taggers/example-ner/checkpoint.pt'), 'SequenceTagger', corpus)
trainer.train('resources/taggers/example-ner',
EvaluationMetric.MICRO_F1_SCORE,
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
checkpoint=True)
使用FlairEmbeddings(你应该)时要考虑的主要事情是,为大型训练数据集生成它们的成本有些高。根据您的设置,您可以设置选项以优化培训时间。有三个问题要问:
1.你有GPU吗?
CharLMEmbeddings使用Pytorch RNN生成,因此针对GPU进行了优化。如果您有一个,您可以设置大型小批量大小以使用批处理。如果没有,您可能想要使用较小的语言模型。对于英语,我们打包嵌入式的“快速”变体,可加载如下:FlairEmbeddings(‘news-forward-fast’)。
2.整个数据集的嵌入是否适合内存?
在最佳情况下,数据集的所有嵌入都适合您的常规内存,这极大地提高了训练速度。如果不是这种情况,则必须在相应的培训师(即ModelTrainer)中设置标志embeddings_in_memory = False以避免内存问题。使用该标志,嵌入是(a)在每个纪元重新计算或(b)从磁盘检索,如果您选择实现磁盘。
3.你有快速硬盘吗?
如果您有快速硬盘驱动器,请考虑将嵌入物实现到磁盘。您可以通过以下方式实现我的实例化FlairEmbedding:FlairEmbeddings(‘news-forward-fast’,use_cache = True)。如果嵌入不适合内存,这可能会有所帮助。此外,如果您没有GPU并且想要对同一数据集进行重复实验,这会有所帮助,因为嵌入只需要计算一次,然后始终从磁盘中检索。