In [None]:
# 安装第三方库
!pip install transformers

In [2]:
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel


torch.manual_seed(12046)

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [3]:
# 使用分词器对文本进行分词
question = 'What is the capital of China?'
ids = tokenizer(question, return_tensors='pt')
ids

{'input_ids': tensor([[2061, 318, 262, 3139, 286, 2807, 30]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [4]:
# 由于GPT-2的模型效果较差,通过增大num_beams和no_repeat_ngram_size来优化生成的文本。
res = model.generate(**ids, max_length=100, early_stopping=True,
 num_beams=3, no_repeat_ngram_size=2)
print(res[0])
print(tokenizer.decode(res[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([ 2061, 318, 262, 3139, 286, 2807, 30, 198, 198, 464,
 3139, 318, 2807, 13, 632, 338, 262, 4387, 3773, 287,
 262, 995, 11, 290, 340, 468, 257, 3265, 286, 517,
 621, 352, 13, 20, 2997, 661, 13, 2807, 318, 530,
 286, 262, 14162, 3957, 16533, 319, 3668, 11, 351, 281,
 5079, 12396, 286, 720, 16, 13, 17, 12989, 13, 383,
 1499, 318, 1363, 284, 257, 1271, 286, 995, 12, 4871,
 11155, 11, 1390, 262, 2059, 286, 3442, 11, 14727, 11,
 262, 3999, 8581, 286, 5483, 13473, 11, 2807, 338, 2351,
 5800, 5693, 290, 262, 21865, 8581, 329, 5800, 13, 198])
What is the capital of China?

The capital is China. It's the largest economy in the world, and it has a population of more than 1.5 billion people. China is one of the fastest growing economies on Earth, with an annual GDP of $1.2 trillion. The country is home to a number of world-class universities, including the University of California, Berkeley, the Chinese Academy of Social Sciences, China's National Science Foundation and the Shanghai Academy

In [5]:
# 问答示例模版
template = '''
Q: What is the capital of the United Kingdom?
A: London.

Q: What is the capital of France?
A: Paris.

Q: %s
A:
'''

In [6]:
print(template % question)


Q: What is the capital of the United Kingdom?
A: London.

Q: What is the capital of France?
A: Paris.

Q: What is the capital of China?
A:



In [7]:
ids2 = tokenizer(template % question , return_tensors='pt')

In [8]:
# 通过问答示例来获得想要的结果
res2 = model.generate(**ids2, max_length=100, early_stopping=True,
 num_beams=3, no_repeat_ngram_size=2)
print(tokenizer.decode(res2[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Q: What is the capital of the United Kingdom?
A: London.

Q: What is the capital of France?
A: Paris.

Q: What is the capital of China?
A:
... Beijing.


In [9]:
# 中文的效果较差
question_zh = '中国的首都在哪里?'
ids_zh = tokenizer(question_zh, return_tensors='pt')
res_zh = model.generate(**ids_zh, max_length=100, early_stopping=True,
 num_beams=3, no_repeat_ngram_size=2)
print(tokenizer.decode(res_zh[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


中国的首都在哪里?。

非常和属经啊建城的自己,那么做喜址器,没有陛下解长支提队的现�


In [10]:
# 即使使用问答示例,也无法获得想要的结果
ids_zh2 = tokenizer(template % question_zh , return_tensors='pt')
res_zh2 = model.generate(**ids_zh2, max_length=100, early_stopping=True,
 num_beams=3, no_repeat_ngram_size=2)
print(tokenizer.decode(res_zh2[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Q: What is the capital of the United Kingdom?
A: London.

Q: What is the capital of France?
A: Paris.

Q: 中国的首都在哪里?
A:
The capital city of China, Shanghai, is located in the middle of a vast expanse of land that stretches from the north to the south. It is also known as the "Great Wall of
