自然语言处理

模块一

1.字符串处理

去除首或尾的部分字符

字符串.strip(去除的字符)

Python strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列。

注意：该方法只能删除开头或是结尾的字符，不能删除中间部分的字符。

eg：

1
2
3

str = "aaa………………aaa"
print(str.strip("a"))	# 这里填啥都行，只要有字符a就行
print(str)

查看输出

1 2	`……………… aaa………………aaa`

只删除左边就只用[字符串].lstrip([去除的字符])，只删除右边就[字符串].rstrip([去除的字符])

eg：

1
2
3

str = "aaa………………aaa"
print(str.lstrip("a"))
print(str.rstrip("a"))

查看输出

1 2	`………………aaa aaa………………`

替换

字符串.replace(旧的字符串, 新的字符串 [, 最大替换次数n])

eg:

str = "aaa…………a……aaa"
print(str.replace('a', 'b'))
print(str.replace('a', 'b', 4))
print(str.replace("aa", 'b'))
print(str)

查看输出

bbb…………b……bbb
bbb…………b……aaa
ba…………a……ba
aaa…………a……aaa

查找

str.find(str [, beg=0 [, end=len(string)] ])

str – 指定检索的字符串
beg – 开始索引，默认为0。
end – 结束索引，默认为字符串的长度。

如果包含子字符串返回开始的索引值，否则返回-1。

eg:

str1 = "this is string example....wow!!!"
str2 = "exam"
 
print(str1.find(str2))
print(str1.find(str2, 10))
print(str1.find(str2, 40))
print(str1.find(str2, 10, 40))
print(str1.find(str2, 0, 10))

查看输出

判断

str.isalpha()

如果字符串至少有一个字符并且所有字符都是字母或文字则返回 True，否则返回 False。

eg:

str = "Jane"
print (str.isalpha())

str = "Jane简"
print (str.isalpha())

str = "Jane简!!!"
print (str.isalpha())

查看输出

1
2
3

True
True
False

str.isdigit()

如果字符串只包含数字则返回 True 否则返回 False。

eg：

str = "1999"
print (str.isdigit())

str = "1999年"
print (str.isdigit())

str = "1999!!!"
print (str.isalpha())

查看输出

1
2
3

True
False
False

分割

str.split(str="", num=string.count(str))

str – 分隔符，默认为所有的空字符，包括空格、换行(\n)、制表符(\t)等。
num – 分割次数，如果设置了这个参数，则最多分割成 maxsplit+1 个子字符串。默认为 -1, 即分隔所有。

返回分割后的字符串列表。

eg:

str = "this is string example....wow!!!"
print (str.split())        # 默认以空格为分隔符
print (str.split('i', 1))   # 以 i 为分隔符
print (str.split('i', 0))   # 以 i 为分隔符
print (str.split('i', -1))   # 以 i 为分隔符
print (str.split('w'))     # 以 w 为分隔符

查看输出

['this', 'is', 'string', 'example....wow!!!']
['th', 's is string example....wow!!!']
['this is string example....wow!!!']
['th', 's ', 's str', 'ng example....wow!!!']
['this is string example....', 'o', '!!!']

拼接

str.join(sequence)

sequence – 要连接的元素序列。

返回通过指定字符连接序列中元素后生成的新字符串。

eg:

symbol = "-"
seq = ("a", "b", "c")       # 字符串序列
print(symbol.join(seq))
seq = ["a", "b", "c"]       # 字符串序列
print(symbol.join(seq))

查看输出

1 2	`a-b-c a-b-c`

帮助文档

help(str)

eg:

1	`help(str)`

输出的是帮助文档

2.正则表达式

在 Python 中，使用 re 模块来处理正则表达式。

re 模块提供了一组函数，允许你在字符串中进行模式匹配、搜索和替换操作。

re 模块使 Python 语言拥有完整的正则表达式功能。

内容	说明
· ^ $	匹配除换行符外的任意字符。开头和结尾，eg：^abc.*xyz$匹配既以 “abc” 开头又以 “xyz” 结尾的字符串
[xxx] [a-zA-Z] [^a-zA-Z]	指定包含字符，eg：a[bc]d可以匹配abd, acd 指定匹配所有英文字母指定不匹配所有英文字母, ^在[]内表示取反
\|	或
以下是预定义字符集：（可以写在字符集[…]中）
\d	数字，等价于[0-9]
\D	非数字，等价于[^\d]
\s	空白字符，等价于[<空格>\t\r\n\f\v]
\S	非空白字符，等价于[^\s]
\w	单词字符，等价于[A-Za-z0-9]
\W	非单词字符，等价于[^\w]
以下是重复：
*	0或多次匹配
+	1或多次
?	匹配0次或1次
{n}, {n,} {n,m}	匹配恰好n次，匹配至少n次匹配n~m次(中间没有空格)
其他：
(…)	用于创建捕获组，可以通过 `group()` 方法获取匹配的子字符串。
\	转义字符，用于匹配一些特殊字符，如 `\.` 匹配实际的点字符

\b的用法

\b 是正则表达式中的单词边界锚点，用于匹配单词的边界位置。这可以用于确保匹配发生在单词的起始或结束位置，而不是在单词的中间。

\b 可以出现在模式的开头、结尾或两者之间，具体取决于匹配的需求。以下是一些用法示例：

在单词开头或结尾匹配：
- \bword\b 匹配整个单词 “word”，确保它出现在字符串的开头或结尾。
在开头或结尾匹配单词的部分：
- \bwo 匹配以 “wo” 开头的单词部分，例如 “word” 中的 “wo”。
- rd\b 匹配以 “rd” 结尾的单词部分，例如 “word” 中的 “rd”。
在开头和结尾同时匹配：
- \bword\b 同时匹配 “word” 的开头和结尾，确保 “word” 是一个完整的单词。
不匹配的情况：
- \babc\b 不会匹配 “abcde”，因为它确保 “abc” 是单词的完整边界。

以下是一些示例：

import re

text = "word abcde word-of-the-day"

# 在单词的开头或结尾匹配整个单词 "word"
matches = re.findall(r'\bword\b', text)
print(matches)  # 输出: ['word']

# 在单词的开头匹配单词部分 "wo"
matches = re.findall(r'\bwo', text)
print(matches)  # 输出: ['wo']

# 在单词的结尾匹配单词部分 "rd"
matches = re.findall(r'rd\b', text)
print(matches)  # 输出: ['rd']

# 同时匹配 "word" 的开头和结尾
matches = re.findall(r'\bword\b', text)
print(matches)  # 输出: ['word']

# 不匹配 "abc" 的情况
matches = re.findall(r'\babc\b', text)
print(matches)  # 输出: []

这些示例展示了 \b 如何在正则表达式中用于处理单词的边界匹配。

(?: … )

在正则表达式中，(?: ... ) 是一个非捕获型分组。通常，正则表达式中的括号用于创建捕获组，以便可以在匹配后引用或获取组内的内容。但有时我们只是想要使用括号来组织模式，而不需要创建一个捕获组，这时就可以使用非捕获型分组。

具体来说：

( ... ) 创建一个捕获组。
(?: ... ) 创建一个非捕获型分组。

使用 (?: ... ) 时，括号内的模式将被视为一个整体，但不会创建一个新的捕获组。这对于在整体模式中使用括号进行逻辑分组而不引入新的捕获组是有用的。

示例：

import re

text = "ababab"

# 匹配 "ab" 重复两次，并使用捕获组
pattern_with_capture_group = re.compile(r'(ab){2}')
matches_with_capture_group = pattern_with_capture_group.search(text)
print(matches_with_capture_group.group(1))  # 输出: ab

# 使用非捕获型分组，不创建新的捕获组
pattern_with_non_capturing_group = re.compile(r'(?:ab){2}')
matches_with_non_capturing_group = pattern_with_non_capturing_group.search(text)
print(matches_with_non_capturing_group.group())  # 输出: abab

在这个示例中，(ab) 是一个捕获组，而 (?:ab) 是一个非捕获型分组。虽然两者都可以匹配 “ab”，但前者创建了一个捕获组，可以通过 group(1) 获取到匹配的内容，而后者不会创建新的捕获组。

re.findall(pattern, string, flags=0) 或 pattern.findall(string[, pos[, endpos]])

在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果有多个匹配模式，则返回元组列表，如果没有找到匹配的，则返回空列表。

pattern 匹配模式。
string 待匹配的字符串。
pos 可选参数，指定字符串的起始位置，默认为 0。
endpos 可选参数，指定字符串的结束位置，默认为字符串的长度。

re.compile(pattern[, flags])

compile 函数用于编译正则表达式，生成一个正则表达式（ Pattern ）对象

pattern : 一个字符串形式的正则表达式
flags 可选，表示匹配模式，比如忽略大小写，多行模式等，具体参数自己查

eg1：

import re
content = "hello自然语言处理, 32edsakfdfi9r"
pattern = re.compile(r'.')
print(re.findall(pattern, content))

查看输出

1	`['h', 'e', 'l', 'l', 'o', '自', '然', '语', '言', '处', '理', ',', ' ', '3', '2', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', '9', 'r']`

eg2：

import re
content = "hello自然语言处理, 32edsakfdfi9r"
pattern = re.compile(r'[eds]')
print(re.findall(pattern, content))

pattern = re.compile(r'[a-zA-Z]')
print(re.findall(pattern, content))

pattern = re.compile(r'[0-9a-zA-Z]')
print(re.findall(pattern, content))

pattern = re.compile(r'[^0-9a-zA-Z]')
print(re.findall(pattern, content))

查看输出

['e', 'e', 'd', 's', 'd']
['h', 'e', 'l', 'l', 'o', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', 'r']
['h', 'e', 'l', 'l', 'o', '3', '2', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', '9', 'r']
['自', '然', '语', '言', '处', '理', ',', ' ']

eg3：

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'[0-9a-zA-Z]')
print(re.findall(pattern, content))

pattern = re.compile(r'[a-zA-Z]|[0-9]')		# 与上等价
print(re.findall(pattern, content))

查看输出

1 2	`['h', 'e', 'l', 'l', 'o', '3', '2', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', '9', 'r'] ['h', 'e', 'l', 'l', 'o', '3', '2', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', '9', 'r']`

eg4:

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'[\d]')
print(re.findall(pattern, content))

pattern = re.compile(r'[\D]')
print(re.findall(pattern, content))

查看输出

1 2	`['3', '2', '9'] ['h', 'e', 'l', 'l', 'o', '自', '然', '语', '言', '处', '理', ',', ' ', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', 'r']`

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'[\s]')
print(re.findall(pattern, content))

pattern = re.compile(r'[\S]')
print(re.findall(pattern, content))

查看输出

1 2	`[' '] ['h', 'e', 'l', 'l', 'o', '自', '然', '语', '言', '处', '理', ',', '3', '2', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', '9', 'r']`

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'[\w]')
print(re.findall(pattern, content))

pattern = re.compile(r'[\W]')
print(re.findall(pattern, content))

查看输出

1 2	`['h', 'e', 'l', 'l', 'o', '自', '然', '语', '言', '处', '理', '3', '2', 'e', 'd', 's', 'a', 'k', 'f', 'd', 'f', 'i', '9', 'r'] [',', ' ']`

eg5：

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'\d*')
print(re.findall(pattern, content))

pattern = re.compile(r'\d+')
print(re.findall(pattern, content))

pattern = re.compile(r'\d?')
print(re.findall(pattern, content))

pattern = re.compile(r'\d{2}')
print(re.findall(pattern, content))

pattern = re.compile(r'\d{1,2}')
print(re.findall(pattern, content))

查看输出

['', '', '', '', '', '', '', '', '', '', '', '', '', '32', '', '', '', '', '', '', '', '', '', '9', '', '']
['32', '9']
['', '', '', '', '', '', '', '', '', '', '', '', '', '3', '2', '', '', '', '', '', '', '', '', '', '9', '', '']
['32']
['32', '9']

re.match(pattern, string, flags=0)

re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match() 就返回 None。

pattern 匹配的正则表达式
string 要匹配的字符串。
flags 标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。参见：正则表达式修饰符 - 可选标志

匹配成功 re.match 方法返回一个匹配的对象，否则返回 None。

我们可以使用 group(num) 或 groups() 匹配对象函数来获取匹配表达式。

group(num=0) 匹配的整个表达式的字符串，group() 可以一次输入多个组号，在这种情况下它将返回一个包含那些组所对应值的元组。
groups() 返回一个包含所有小组字符串的元组，从 1 到所含的小组号。

re.match 与 re.search的区别：

re.match 只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回 None，而 re.search 匹配整个字符串，直到找到一个匹配。

eg:

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'\w*')
match = re.match(pattern, content)
if (match):
    print(match.group())
else:
    print("未匹配")

pattern = re.compile(r'\d+')
match = re.match(pattern, content)
if (match):
    print(match.group())
else:
    print("未匹配")

查看输出

1 2	`hello自然语言处理未匹配`

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'\d+')
match = re.match(pattern, content)
if (match):
    print(match.group())
else:
    print("未匹配")

pattern = re.compile(r'\d+')
s = re.search(pattern, content)
if (s):
    print(s.group())
else:
    print("未匹配")

查看输出

1
2

未匹配
32

3.re模块中的其他常用函数

检索和替换

Python 的re模块提供了re.sub用于替换字符串中的匹配项。

语法：

1	`re.sub(pattern, repl, string, count=0, flags=0)`

参数：

pattern : 正则中的模式字符串。
repl : 替换的字符串，也可为一个函数。
string : 要被查找替换的原始字符串。
count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。
flags : 编译时用的匹配模式，数字形式。

前三个为必选参数，后两个为可选参数。

eg：

import re
 
phone = "2004-959-559 # 这是一个电话号码"
 
# 删除注释
num = re.sub(r'#.*$', "", phone)
print ("电话号码 : ", num)
 
# 移除非数字的内容
num = re.sub(r'\D', "", phone)
print ("电话号码 : ", num)

查看输出

1 2	`电话号码 : 2004-959-559 电话号码 : 2004959559`

切片

split 方法按照能够匹配的子串将字符串分割后返回列表，它的使用形式如下：

1	`re.split(pattern, string[, maxsplit=0, flags=0])`

参数：

pattern 匹配的正则表达式
string 要匹配的字符串。
maxsplit 分割次数，maxsplit=1 分割一次，默认为 0，不限制次数。
flags 标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。参见：正则表达式修饰符 - 可选标志

eg：

import re
content = "hello自然语言处理, 32edsakfdfi9r"

pattern = re.compile(r'\d+')
print(re.split(pattern, content))

查看输出

1	`['hello自然语言处理, ', 'edsakfdfi', 'r']`

命名组

(?P<……>规则)

……里面是你给这个组起的名字

eg：

import re

content = "hello123"

pattern = re.compile(r'(?P<字母>[a-z]+)(?P<数字>\d+)')
match = re.match(pattern, content)

if match:
    # 获取第一个捕获组的值（小写字母部分）
    group1 = match.group("字母")
    
    # 获取第二个捕获组的值（数字部分）
    group2 = match.group("数字")

    print("Group 1 (小写字母部分):", group1)
    print("Group 2 (数字部分):", group2)
else:
    print("未匹配")

查看输出

1 2	`Group 1 (小写字母部分): hello Group 2 (数字部分): 123`

4.NLTK工具包简介

非常实用的文本处理工具，主要用于英文数据，历史悠久~

下载：

1	`pip install nltk`

打开可视化界面：

1 2	`import nltk nltk.download()`

下面要用到的包

包名	解释
punkt	分词
stopwords	停用词
averaged perceptron tagger	词性标注
maxent_ne_chunker	命名实体识别
words	命名实体识别

分词

from nltk.tokenize import word_tokenize

content = "Of all the changes that have taken place in English-language newspapers during \
the past quarter-century, perhaps the most far-reaching has been the inexorable \
decline in the scope and seriousness of their arts coverage."

tokens = word_tokenize(content)     # 分词

tokens = [word.lower() for word in tokens]; # 转小写

print(tokens)

查看输出

['of', 'all', 'the', 'changes', 'that', 'have', 'taken', 'place', 'in', 'english-language', 'newspapers', 'during', 'the', 'past', 'quarter-century', ',', 'perhaps', 'the', 'most', 'far-reaching', 'has', 'been', 'the', 'inexorable', 'decline', 'in', 'the', 'scope', 'and', 'seriousness', 'of', 'their', 'arts', 'coverage', '.']

Text对象

说明文档

1 2	`import nltk.text print(help(nltk.text))`

使用：

from nltk.tokenize import word_tokenize
from nltk.text import Text

content = "Of all the changes that have taken place in English-language newspapers during \
the past quarter-century, perhaps the most far-reaching has been the inexorable \
decline in the scope and seriousness of their arts coverage."

tokens = word_tokenize(content)     # 分词

tokens = [word.lower() for word in tokens]; # 转小写

# 1、创建一个Text对象方便后序操作
t = Text(tokens)    

# 2、计数
print(t.count("in"))

# 3、出现的位置
print(t.index("in"))

# 4、出现频率前8的单词
t.plot(8)

查看输出

1
2

2
8

5.停用词过滤

停用词（Stop words）是在文本处理中被忽略或删除的常见词语。这些词通常是语言中最常见且对于理解文本意义贡献较小的词汇。在自然语言处理（NLP）和文本分析任务中，通常会将这些停用词从文本中去除，以便集中注意力于更重要或有意义的词汇。

停用词的例子包括常见的介词、连词、代词以及一些高频出现但通常不携带特定含义的单词。例如，对于英文文本，一些常见的停用词包括 “the”、“and”、“is”、“of” 等。

在文本处理任务（如文本分类、信息检索、文本挖掘等）中，去除停用词有助于减少文本数据的噪声，提高关键词的重要性，减小特征空间的维度，同时加快处理速度。停用词的具体列表可能因任务而异，通常会根据具体需求进行定制。

在很多自然语言处理工具库中，包括 NLTK（Natural Language Toolkit）和Scikit-learn，都提供了一些通用的停用词列表，也允许用户根据需要自定义停用词。

停用词典

1
2
3

from nltk.corpus import stopwords   # 导入停用词典

print(stopwords.readme())

查看停用词：

from nltk.corpus import stopwords   # 导入停用词典

print(stopwords.fileids())  # 查看包含什么语言的停用词
print(stopwords.raw("chinese").replace('\n', ' '))  # 查看汉语的停用词
print(stopwords.words("chinese"))  # 查看汉语的停用词 返回的是列表

查看输出

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 
'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
一 一下 一些 一切 一则 一天 一定 一方面 一旦 一时 一来 一样 一次 一片 一直 一致 一般 一起 一边 一面 万一 上下 上升 上去 上来 上述 上面 下列 下去 下来
 下面 不一 不久 不仅 不会 不但 不光 不单 不变 不只 不可 不同 不够 不如 不得 不怕 不惟 不成 不拘 不敢 不断 不是 不比 不然 不特 不独 不管 不能 不要 不 
论 不足 不过 不问 与 与其 与否 与此同时 专门 且 两者 严格 严重 个 个人 个别 中小 中间 丰富 临 为 为主 为了 为什么 为什麽 为何 为着 主张 主要 举行 乃 
乃至 么 之 之一 之前 之后 之後 之所以 之类 乌乎 乎 ………………
['一', '一下', '一些', '一切', '一则', '一天', '一定', '一方面', '一旦', '一时', '一来', '一样', '一次', '一片', '一直', '一致', '一般', '一起', '一 
边', '一面', '万一', '上下', '上升', '上去', '上来', '上述', '上面', '下列', '下去', '下来', '下面', '不一', '不久', '不仅', '不会', '不但', '不光', 
'不单', '不变', '不只', '不可', '不同', '不够', '不如', '不得', '不怕', '不惟', '不成', '不拘', '不敢', '不断', '不是', '不比', '不然', '不特', '不独
', '不管', '不能', '不要', '不论', '不足', ………………

使用

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

content = "Of all the changes that have taken place in English-language newspapers during \
the past quarter-century, perhaps the most far-reaching has been the inexorable \
decline in the scope and seriousness of their arts coverage."

tokens = word_tokenize(content)     # 分词

tokens = [word.lower() for word in tokens]; # 转小写

tokens = set(tokens)    # 转成集合，因为要取交集

print(tokens.intersection(set(stopwords.words("english"))))     # 和停用词取交集

# 过滤掉停用词
filtered = [w for w in tokens if w not in set(stopwords.words("english"))]

print(filtered)

查看输出

1
2

{'in', 'of', 'most', 'has', 'that', 'been', 'have', 'all', 'the', 'their', 'during', 'and'}
['arts', 'changes', 'quarter-century', 'place', 'newspapers', 'english-language', 'perhaps', '.', 'scope', 'far-reaching', 'inexorable', 'seriousness', 'taken', 'coverage', 'decline', 'past', ',']

6.词性标注

下面是一个简单的表格，展示了NLTK中常用的一些词性标记（Penn Treebank标记集）以及它们的含义：

标记	含义
CC	连词（Coordinating conjunction）
CD	数字词（Cardinal number）
DT	限定词（Determiner）
EX	存在量词（Existential there）
FW	外来词（Foreign word）
IN	介词或从属连词（Preposition or subordinating conjunction）
JJ	形容词（Adjective）
JJR	比较级形容词（Adjective, comparative）
JJS	最高级形容词（Adjective, superlative）
LS	列表标记（List item marker）
MD	情态动词（Modal）
NN	名词，单数或非可数名词（Noun, singular or mass）
NNS	名词，复数（Noun, plural）
NNP	专有名词，单数（Proper noun, singular）
NNPS	专有名词，复数（Proper noun, plural）
PDT	前置限定词（Predeterminer）
POS	所有格标记（Possessive ending）
PRP	人称代词（Personal pronoun）
PRP$	所有格代词（Possessive pronoun）
RB	副词（Adverb）
RBR	比较级副词（Adverb, comparative）
RBS	最高级副词（Adverb, superlative）
RP	小品词（Particle）
SYM	符号（Symbol）
TO	附加到动词以形成不定式的 “to”（to）
UH	感叹词（Interjection）
VB	动词基本形式（Verb, base form）
VBD	动词过去式（Verb, past tense）
VBG	动词现在分词或动名词（Verb, gerund or present participle）
VBN	动词过去分词（Verb, past participle）
VBP	动词非第三人称单数现在时（Verb, non-3rd person singular present）
VBZ	动词第三人称单数现在时（Verb, 3rd person singular present）
WDT	关系或疑问限定词（Wh-determiner）
WP	关系或疑问代词（Wh-pronoun）
WP$	关系或疑问代词所有格（Possessive wh-pronoun）
WRB	关系或疑问副词（Wh-adverb）

这些标记用于表示词性（part-of-speech）和语法结构，它们在文本处理中经常被用于词性标注和其他自然语言处理任务。

使用

from nltk.tokenize import word_tokenize
from nltk import pos_tag

content = "Of all the changes that have taken place in English-language newspapers during \
the past quarter-century, perhaps the most far-reaching has been the inexorable \
decline in the scope and seriousness of their arts coverage."

tokens = word_tokenize(content)     # 分词

tokens = [word.lower() for word in tokens]; # 转小写

print(pos_tag(tokens))  # 词性标注

查看输出

1
2

[('of', 'IN'), ('all', 'PDT'), ('the', 'DT'), ('changes', 'NNS'), ('that', 'WDT'), ('have', 'VBP'), ('taken', 'VBN'), ('place', 'NN'), ('in', 'IN'), 
('english-language', 'JJ'), ('newspapers', 'NNS'), ('during', 'IN'), ('the', 'DT'), ('past', 'JJ'), ('quarter-century', 'NN'), (',', ','), ('perhaps', 'RB'), ('the', 'DT'), ('most', 'RBS'), ('far-reaching', 'JJ'), ('has', 'VBZ'), ('been', 'VBN'), ('the', 'DT'), ('inexorable', 'JJ'), ('decline', 'NN'), ('in', 'IN'), ('the', 'DT'), ('scope', 'NN'), ('and', 'CC'), ('seriousness', 'NN'), ('of', 'IN'), ('their', 'PRP$'), ('arts', 'NNS'), ('coverage', 'NN'), ('.', '.')]

分块

在自然语言处理中，分块（Chunking）是将文本划分成有意义的短语块的过程。常见的应用包括命名实体识别（Named Entity Recognition，NER）和短语结构分析（Phrase Structure Parsing）。

NLTK 提供了用于分块的工具，其中最常用的是基于正则表达式的 RegexpParser。以下是一个简单的例子，演示如何使用 NLTK 进行分块：

import nltk

# 示例文本
sentence = "The quick brown fox jumps over the lazy dog."

# 使用词性标注
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

# 定义分块规则
chunk_rule = r"""Chunk: {<DT>?<JJ>*<NN.*>+}"""

# 创建分块器
chunk_parser = nltk.RegexpParser(chunk_rule)

# 应用分块器
tree = chunk_parser.parse(pos_tags)

# 打印分块结果
print(tree)
tree.draw()     # 调用matplotlib库画出来

在这个例子中，我们首先对文本进行了词性标注，然后定义了一个简单的分块规则。规则使用正则表达式，指定了我们感兴趣的短语块的结构。在这个规则中，我们定义了一个名为 “Chunk” 的短语块，该短语块包含零个或一个限定词（DT）、零个或多个形容词（JJ）、一个或多个名词（NN）。

然后，我们创建了一个 RegexpParser 实例，将分块规则应用于已标注的词性，最后得到一个带有分块结构的树状图。

你可以根据任务和文本的特点定义自己的分块规则。在处理大量文本时，通常需要更复杂的规则和技术，以确保准确地捕捉所需的短语块。

查看输出

(S
  (Chunk The/DT quick/JJ brown/NN fox/NN)
  jumps/VBZ
  over/IN
  (Chunk the/DT lazy/JJ dog/NN)
  ./.)

命名实体识别

命名实体（Named Entity）是指在文本中表示某种实体的词或短语，通常是具有特定名称的实体，如人名、地名、组织名、日期、时间、百分比、货币等。命名实体识别（Named Entity Recognition，NER）是自然语言处理中的一个重要任务，它的目标是从文本中识别并分类这些命名实体。

NER 任务的目的是识别文本中的实体，并将它们分类为预定义的类别，通常包括：

人名（Person）： 识别文本中的个体的名字，如"John Smith"。
地名（Location）： 识别地理位置的名字，如"New York City"。
组织名（Organization）： 识别组织、公司、团体等的名字，如"Microsoft"。
日期（Date）： 识别表示日期的词语，如"January 1, 2022"。
时间（Time）： 识别表示时间的词语，如"3:30 PM"。
百分比（Percentage）： 识别表示百分比的词语，如"50%"。
货币（Money）： 识别表示货币金额的词语，如"$100"。
其他自定义类别： 可根据任务需求定义其他命名实体类别，如产品名、专业术语等。

NER 的应用广泛，包括信息提取、问答系统、机器翻译、自动摘要等领域。通过识别文本中的命名实体，计算机可以更好地理解文本的含义，从而支持更复杂的自然语言处理任务。

实例：

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# 示例文本
sentence = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne on April 1, 1976."

# 分词和词性标注
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

# 命名实体识别
ner_tree = ne_chunk(pos_tags)

# 打印命名实体识别结果
print(ner_tree)

查看输出

(S
  (PERSON Apple/NNP)
  (ORGANIZATION Inc./NNP)
  was/VBD
  founded/VBN
  by/IN
  (PERSON Steve/NNP Jobs/NNP)
  ,/,
  (PERSON Steve/NNP Wozniak/NNP)
  ,/,
  and/CC
  (PERSON Ronald/NNP Wayne/NNP)
  on/IN
  April/NNP
  1/CD
  ,/,
  1976/CD
  ./.)

7.数据清洗实例

下面是一个简单的示例脚本，演示了如何使用 re 模块和 NLTK 进行文本数据清洗。这个示例包括去除无关字符、超链接、HTML 标签、停用词、专有名词缩写，去除多余空格，以及分词的处理步骤。

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# 示例文本
raw_text = """
<p>This is an example text with <a href="https://example.com">hyperlink</a>.</p>
The content includes some stop words and abbreviations like <i>etc.</i> and <abbr title="NLP">Natural Language Processing</abbr>.
"""

# 1. 去除HTML标签
clean_text = re.sub(r'<.*?>', '', raw_text)
print("去除HTML标签后：")
print(clean_text)

# 2. 去除无关字符
clean_text = re.sub(r'[^a-zA-Z\s]', '', clean_text)
print("\n去除无关字符后：")
print(clean_text)

# 3. 去除超链接
clean_text = re.sub(r'<a.*?>|</a>', '', clean_text)
print("\n去除超链接后：")
print(clean_text)

# 4. 分词
tokenized_text = word_tokenize(clean_text)
print("\n分词结果：")
print(tokenized_text)

# 5. 去除停用词
stop_words = set(stopwords.words('english'))
filtered_text = [word for word in tokenized_text if word.lower() not in stop_words]
print("\n去除停用词后：")
print(filtered_text)

# 6. 去除专有名词缩写
cleaned_text = re.sub(r'\b(?:etc\.|NLP)\b', '', ' '.join(filtered_text))
print("\n去除专有名词缩写后：")
print(cleaned_text)

# 7. 去除多余空格
cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
print("\n去除多余空格后：")
print(cleaned_text)

# 输出最终句子
final_sentence = cleaned_text
print("\n最终句子：")
print(final_sentence)

查看输出

去除HTML标签后：

This is an example text with hyperlink.
The content includes some stop words and abbreviations like etc. and Natural Language Processing.


去除无关字符后：

This is an example text with hyperlink
The content includes some stop words and abbreviations like etc and Natural Language Processing  


去除超链接后：

This is an example text with hyperlink
The content includes some stop words and abbreviations like etc and Natural Language Processing  


分词结果：
['This', 'is', 'an', 'example', 'text', 'with', 'hyperlink', 'The', 'content', 'includes', 'some', 'stop', 'words', 'and', 'abbreviations', 'like', 'etc', 
'and', 'Natural', 'Language', 'Processing']

去除停用词后：
['example', 'text', 'hyperlink', 'content', 'includes', 'stop', 'words', 'abbreviations', 'like', 'etc', 'Natural', 'Language', 'Processing']

去除专有名词缩写后：
example text hyperlink content includes stop words abbreviations like etc Natural Language Processing

去除多余空格后：
example text hyperlink content includes stop words abbreviations like etc Natural Language Processing

最终句子：
example text hyperlink content includes stop words abbreviations like etc Natural Language Processing

解释正则表达式

<.*?>

如果我们使用 <.*> 进行匹配，它将匹配整个字符串，包括  到  的内容。
但是，如果我们使用 <.*?>，它将匹配  到  之间的最小字符序列，即 bold。

在正则表达式中，? 是一个元字符，用于表示前面的字符或者分组是可选的，即出现 0 次或 1 次。而在这个特定的上下文中，*? 的组合是用于非贪婪匹配，表示尽可能少地匹配。

具体解释如下：

* 表示匹配前面的字符零次或多次（贪婪匹配，尽可能多地匹配）。
*? 表示匹配前面的字符零次或多次，但尽可能少地匹配（非贪婪匹配）。

在 <.*?> 中：

< 匹配左尖括号。
.*? 匹配任意字符（除换行符外）的零个或多个实例，尽可能少地匹配。
> 匹配右尖括号。

这个正则表达式主要用于匹配 HTML 或 XML 标签中的内容，并确保在遇到第一个右尖括号 > 之前尽可能地匹配最小的字符序列。如果没有 ?，.* 将贪婪地匹配到最后一个右尖括号 >，导致匹配跨越多个标签，这通常不是我们所期望的。因此，? 在这里用于确保匹配尽可能短的字符序列。

\b(?:etc\.|NLP)\b

正则表达式 \b(?:etc\.|NLP)\b 是一个用于匹配单词边界的模式，其中包含了两个选择项 (?:etc\.|NLP)：

\b: 表示单词边界。
(?: ... ): 表示非捕获型分组，用于将其中的内容视为一个整体，但不会生成捕获组。
etc\.: 匹配单词 “etc.”，其中 \. 表示匹配字面量的句点。
|: 表示逻辑或，即匹配前面或后面的任一模式。
NLP: 匹配单词 “NLP”。

整体来说，这个正则表达式用于匹配包含单词 “etc.” 或 “NLP” 的整个单词，而不是这两个词的部分。同时，通过使用 \b 来确保匹配发生在单词边界处，以避免匹配到包含这些词的其他单词的一部分。

举例来说，如果应用于文本 “NLP is important, etc.”，它会匹配整个单词 “NLP”，而不会匹配 “important” 中的 “NLP” 部分。同理，它会匹配整个单词 “etc.”，而不会匹配 “important” 中的 “etc.” 部分。

8.spaCy工具包

官网

先下载再说，Anaconda Prompt (anaconde)要用管理员身份打开

文本处理

import spacy

# 加载英文核心模型
nlp = spacy.load("en_core_web_sm")

# 在文本上应用模型
text = "Weather is good, very windy and sunny. We have no classes in the afternoon."
doc = nlp(text)

# 分词
print("以下是分词：")
for token in doc:
    print(token)

# 分句
print("以下是分句：")
for sentence in doc.sents:
    print(sentence);

查看输出

以下是分词：
Weather
is
good
,
very
windy
and
sunny
.
We
have
no
classes
in
the
afternoon
.
以下是分句：
Weather is good, very windy and sunny.
We have no classes in the afternoon.

词性

import spacy

# 加载英文核心模型
nlp = spacy.load("en_core_web_sm")

# 在文本上应用模型
text = "Weather is good, very windy and sunny. We have no classes in the afternoon."
doc = nlp(text)

# 词性
for token in doc:
    print("{}——{}".format(token, token.pos_))

查看输出

Weather——NOUN
is——AUX
good——ADJ
,——PUNCT
very——ADV
windy——ADJ
and——CCONJ
sunny——ADJ
.——PUNCT
We——PRON
have——VERB
no——DET
classes——NOUN
in——ADP
the——DET
afternoon——NOUN
.——PUNCT

命名实体识别

下面是一些常见的spaCy命名实体标签，以及它们代表的实体类型。请注意，这里只是一些常见的示例，spaCy还可以识别其他类型的实体。

标签	说明	示例
PERSON	人名	“John”, “Alice”
ORG	组织名	“Google”, “Microsoft”
GPE	地理政治实体	“New York”, “Europe”
DATE	日期	“2022-01-01”, “yesterday”
TIME	时间	“12:30 PM”, “3 hours”
MONEY	货币	“$10”, “€20”
PERCENT	百分比	“50%”, “100%”
CARDINAL	基数	“one”, “2”, “three”
ORDINAL	序数	“first”, “second”

import spacy

# 加载英文核心模型
nlp = spacy.load("en_core_web_sm")

# 在文本上应用模型
text = "I went to Paris where I met my old friend Jack from uni."
doc = nlp(text)

# 命名实体识别
for ent in doc.ents:
    print("{}——{}".format(ent, ent.label_))

查看输出

1 2	`Paris——GPE Jack——PERSON`

jupyter高亮显示：

9.名字实体匹配

找到书中所有的人名

import spacy
from collections import Counter

# 加载英文核心模型
nlp = spacy.load("en_core_web_sm")

def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()
    
# 加载文本数据
text = read_file("./data/Computers—the machines we think with by Daniel S. Halacy.txt")
processed_text = nlp(text)

# 取出句子
sentences = [s for s in processed_text.sents];
print(len(sentences))
print(sentences[:3])

# 找到出现次数最多的10个人名
def find_person(doc):
    c = Counter()
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            c[ent.lemma_] += 1
    return c.most_common(10)
print(find_person(processed_text))

查看输出

3759
[The Project Gutenberg eBook of Computers—the machines we think with

This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever., You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org., If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

]
[('Project Gutenberg', 54), ('Jacquard', 14), ('Skinner', 12), ('Richardson', 10), ('John', 8), ('Gutenberg', 8), ('Remington Rand', 7), ('Hollerith', 6), 
('Darwin', 6), ('Greeks', 5)]

10.恐怖袭击分析

11.统计分析结果

12.结巴分词器

安装：

1	`pip install jieba`

分词工具

import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("全模式：" + "/".join(seg_list))

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("精确模式：" + "/".join(seg_list))

seg_list = jieba.cut("我来到北京清华大学")
print("默认精确模式：" + "/".join(seg_list))

查看输出

1
2
3

全模式：我/来到/北京/清华/清华大学/华大/大学
精确模式：我/来到/北京/清华大学
默认精确模式：我/来到/北京/清华大学

添加自定义词典

发现结巴分词器不认识一些词

import jieba

text = "故宫的著名景点包括乾清宫、太和殿和黄琉璃瓦等"

seg_list = jieba.cut(text, cut_all=True)
print("全模式：" + "/".join(seg_list))

seg_list = jieba.cut(text, cut_all=False)
print("精确模式：" + "/".join(seg_list))

查看输出

1
2

全模式：故宫/的/著名/著名景点/景点/包括/乾/清宫/、/太和/太和殿/和/黄/琉璃/琉璃瓦/等
精确模式：故宫/的/著名景点/包括/乾/清宫/、/太和殿/和/黄/琉璃瓦/等

这时候我们可以通过自定义词典来解决：

查看规则

文件要以utf-8格式保存为.txt文件，文件内格式为：

例子中：

1	`人工智能 1 n`

人工智能 是词汇本身。
1 是该词汇的权重，它的值是1。
n 是词汇的词性，表示名词。

在结巴分词器中，权重参数是可选的，如果没有提供，默认为0。当有多个分词方案时，结巴分词器会优先选择权重较高的分词结果。如果你不需要使用权重，你可以简化词典的格式：

人工智能 n

在这种情况下，权重将被默认为0。

结巴分词器使用了一套词性标记系统，其中包含了多个标签，每个标签表示不同的词性。以下是结巴分词器中一些常见的词性标签及其含义：

n（名词）： 用于表示名词，如“人工智能”中的“人工”和“智能”。
v（动词）： 用于表示动词，如“学习”中的“学”。
a（形容词）： 用于表示形容词，如“聪明”中的“聪明”。
d（副词）： 用于表示副词，如“很快”中的“很”。
m（数量词）： 用于表示数量词，如“三个人”中的“三”。
q（量词）： 用于表示量词，如“几本书”中的“几”。
r（代词）： 用于表示代词，如“我”、“你”等。
p（介词）： 用于表示介词，如“在”、“上”等。
c（连词）： 用于表示连词，如“和”、“或”等。
u（助词）： 用于表示助词，如“的”、“了”等。
xc（其他虚词）： 用于表示其他虚词，如“着”、“地”等。

这些标签的组合可以形成不同的词性，例如，“n”表示名词，“nr”表示人名，“ns”表示地名等。结巴分词器的词性标记系统相对灵活，允许用户根据需要进行自定义。

在结巴分词器的自定义词典中，你可以使用这些词性标签来明确指定每个词汇的词性。例如：

人工智能 n
学习 v
聪明 a
很 d

在这个例子中，“人工智能”被标注为名词（n）、“学习”被标注为动词（v）、“聪明”被标注为形容词（a）、“很”被标注为副词（d）。

eg:

import jieba

jieba.load_userdict("./data/my_dist.txt")   # 需保存为utf-8格式

text = "故宫的著名景点包括乾清宫、太和殿和黄琉璃瓦等"

seg_list = jieba.cut(text, cut_all=True)
print("全模式：" + "/".join(seg_list))

seg_list = jieba.cut(text, cut_all=False)
print("精确模式：" + "/".join(seg_list))

查看输出

1
2

全模式：故宫/的/著名/著名景点/景点/包括/乾清宫/清宫/、/太和/太和殿/和/黄琉璃瓦/琉璃/琉璃瓦/等
精确模式：故宫/的/著名景点/包括/乾清宫/、/太和殿/和/黄琉璃瓦/等

my_dist.txt中的内容为：

1 2	`乾清宫 n 黄琉璃瓦 n`

关键词抽取

import jieba
from jieba import analyse

jieba.load_userdict("./data/my_dist.txt")   # 需保存为utf-8格式

text = "故宫的著名景点包括乾清宫、太和殿和黄琉璃瓦等"

seg_list = jieba.cut(text, cut_all=False)

# 获取关键词
tags = analyse.extract_tags(text, topK=5)
print("关键词：" + ' '.join(tags))

# 打印带权重的
tags = analyse.extract_tags(text, topK=5, withWeight=True)
for word, weight in tags:
    print(word, weight)

查看输出

关键词：著名景点 乾清宫 黄琉璃瓦 太和殿 故宫
著名景点 2.3167796086666668
乾清宫 1.9924612504833332
黄琉璃瓦 1.9924612504833332
太和殿 1.6938346722833335
故宫 1.5411195503033335

词性标注

import jieba.posseg as pseg

words = pseg.cut("我爱北京天安门")
for word,flag in words:
  print("%s %s" % (word, flag))

查看输出

我 r
爱 v
北京 ns
天安门 ns

模块二

启动jupyter并配置参数

1	`jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10`

模块三

1.贝叶斯算法概述

贝叶斯简介
贝叶斯(约1701-1761)Thomas Bayes，英国数学家
贝叶斯方法源于他生前为解决一个“逆概”问题写的一篇文章

引例：

现在有男生:60%，女生:40%，男生总是穿长裤，女生则一半穿长裤一半穿裙子

正向概率：随机选取一个学生，他(她) 穿长裤的概率和穿裙子的概率是多大

逆向概率：迎面走来一个穿长裤的学生，你只看得见他(她)穿的是否长裤而无法确定他(她)的性别，你能够推断出他(她)是女生的概率是多大吗?

查看输出