Python正则表达式精通指南：从基础到高级应用

正则表达式是处理文本的强大工具，掌握它可以让你的文本处理工作事半功倍。Python通过re模块提供了全面的正则表达式支持。在这篇文章中，我将带你从基础到高级，全面掌握Python中的正则表达式。

正则表达式基础

什么是正则表达式？

正则表达式（Regular Expression，简称regex）是一种用于描述字符串模式的特殊语法。它可以用来搜索、替换和验证文本。

Python中的re模块

Python的re模块提供了使用正则表达式的接口：

import re

# 搜索模式
result = re.search(r'pattern', 'string to search')

# 匹配开头
result = re.match(r'pattern', 'string to match')

# 查找所有匹配
results = re.findall(r'pattern', 'string to find all matches')

# 替换
new_string = re.sub(r'pattern', 'replacement', 'string to modify')

# 分割
parts = re.split(r'pattern', 'string to split')

基本模式匹配

最简单的正则表达式就是直接匹配字符：

import re

# 搜索单词"Python"
result = re.search(r'Python', 'I love Python programming')
print(result)  # <re.Match object; span=(7, 13), match='Python'>

# 获取匹配的字符串
print(result.group())  # Python

# 获取匹配的位置
print(result.start(), result.end())  # 7 13

元字符和特殊序列

正则表达式的强大之处在于它的元字符和特殊序列。

常用元字符

元字符	描述
`.`	匹配除换行符外的任何字符
`^`	匹配字符串的开始
`$`	匹配字符串的结束
`*`	匹配前面的模式零次或多次
`+`	匹配前面的模式一次或多次
`?`	匹配前面的模式零次或一次
`{n}`	精确匹配前面的模式n次
`{n,}`	匹配前面的模式至少n次
`{n,m}`	匹配前面的模式n到m次
`\`	转义字符
`[]`	字符集，匹配括号内的任一字符
`\|`	或运算符，匹配`\|`前或后的模式
`()`	分组

特殊序列

特殊序列	描述
`\d`	匹配任何十进制数字，相当于`[0-9]`
`\D`	匹配任何非数字字符，相当于`[^0-9]`
`\s`	匹配任何空白字符，相当于`[ \t\n\r\f\v]`
`\S`	匹配任何非空白字符，相当于`[^ \t\n\r\f\v]`
`\w`	匹配任何字母数字字符，相当于`[a-zA-Z0-9_]`
`\W`	匹配任何非字母数字字符，相当于`[^a-zA-Z0-9_]`
`\b`	匹配单词边界
`\B`	匹配非单词边界

实例演示

import re

text = "Python 3.9 was released on 2020-10-05, Python 3.10 on 2021-10-04."

# 匹配所有数字
digits = re.findall(r'\d', text)
print(digits)  # ['3', '9', '2', '0', '2', '0', '1', '0', '0', '5', '3', '1', '0', '2', '0', '2', '1', '1', '0', '0', '4']

# 匹配所有数字序列
numbers = re.findall(r'\d+', text)
print(numbers)  # ['3', '9', '2020', '10', '05', '3', '10', '2021', '10', '04']

# 匹配日期格式
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(dates)  # ['2020-10-05', '2021-10-04']

# 匹配Python版本
versions = re.findall(r'Python \d\.\d+', text)
print(versions)  # ['Python 3.9', 'Python 3.10']

字符集和范围

字符集允许你指定一组可能的字符。

基本字符集

import re

text = "The quick brown fox jumps over the lazy dog."

# 匹配元音字母
vowels = re.findall(r'[aeiou]', text)
print(vowels)  # ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'e', 'a', 'o']

# 匹配辅音字母
consonants = re.findall(r'[bcdfghjklmnpqrstvwxyz]', text)
print(len(consonants))  # 21

范围

# 匹配所有小写字母
lowercase = re.findall(r'[a-z]', text)
print(len(lowercase))  # 35

# 匹配所有大写字母
uppercase = re.findall(r'[A-Z]', text)
print(uppercase)  # ['T']

# 匹配所有字母和数字
alphanumeric = re.findall(r'[a-zA-Z0-9]', text)
print(len(alphanumeric))  # 36

否定字符集

# 匹配非空白字符
non_whitespace = re.findall(r'[^\s]', text)
print(len(non_whitespace))  # 36

# 匹配非字母字符
non_alpha = re.findall(r'[^a-zA-Z]', text)
print(non_alpha)  # [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']

分组和捕获

分组允许你将正则表达式的一部分视为一个单元，可以用于捕获匹配的子字符串。

基本分组

import re

text = "Python 3.9 was released on 2020-10-05"

# 捕获版本号
match = re.search(r'Python (\d\.\d+)', text)
if match:
    version = match.group(1)
    print(version)  # 3.9

命名分组

# 使用命名分组
match = re.search(r'Python (?P<version>\d\.\d+) was released on (?P<date>\d{4}-\d{2}-\d{2})', text)
if match:
    version = match.group('version')
    date = match.group('date')
    print(f"Version: {version}, Release date: {date}")
    # Version: 3.9, Release date: 2020-10-05

非捕获分组

有时你需要分组但不需要捕获匹配的内容：

1
2
3

# 非捕获分组
results = re.findall(r'Python (?:\d\.\d+)', text)
print(results)  # ['Python 3.9']

反向引用

你可以在模式中引用之前捕获的分组：

# 查找重复的单词
text = "The the quick brown fox"
match = re.search(r'\b(\w+)\s+\1\b', text, re.IGNORECASE)
if match:
    print(f"重复的单词: {match.group(1)}")
    # 重复的单词: The

贪婪与非贪婪匹配

默认情况下，量词是贪婪的，它们会尽可能多地匹配字符。

贪婪匹配

import re

text = "<div>Content 1</div><div>Content 2</div>"

# 贪婪匹配
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group())  # <div>Content 1</div><div>Content 2</div>

非贪婪匹配

1
2
3

# 非贪婪匹配
non_greedy = re.search(r'<div>.*?</div>', text)
print(non_greedy.group())  # <div>Content 1</div>

前瞻和后顾

前瞻和后顾允许你基于前面或后面的内容来匹配文本，但不包括这些内容在匹配结果中。

正向前瞻

import re

text = "Python is great, javascript is also great"

# 匹配后面跟着"is great"的单词
results = re.findall(r'\w+(?= is great)', text)
print(results)  # ['Python', 'javascript']

负向前瞻

1
2
3

# 匹配后面不跟着"is great"的单词
results = re.findall(r'\w+(?! is great)', text)
print(results)  # ['is', 'great', 'javascript', 'is', 'also', 'great']

正向后顾和负向后顾

注意：Python的re模块不支持后顾断言，但从Python 3.6开始，可以使用有限形式的后顾。

# 匹配前面是"Python"的单词（Python 3.6+）
text = "Python is great, Python was created by Guido"
results = re.findall(r'(?<=Python )\w+', text)
print(results)  # ['is', 'was']

# 匹配前面不是"Python"的单词（Python 3.6+）
results = re.findall(r'(?<!Python )\w+', text)
print(results)  # ['Python', 'great', 'Python', 'created', 'by', 'Guido']

标志和选项

正则表达式可以使用各种标志来修改其行为。

常用标志

import re

text = """Python is case-sensitive.
PYTHON is uppercase.
python is lowercase."""

# 忽略大小写
results = re.findall(r'python', text, re.IGNORECASE)
print(results)  # ['Python', 'PYTHON', 'python']

# 多行模式
results = re.findall(r'^python', text, re.MULTILINE)
print(results)  # ['python']

# 点号匹配所有字符，包括换行符
text = "Python spans\nmultiple lines"
results = re.findall(r'spans.multiple', text, re.DOTALL)
print(results)  # ['spans\nmultiple']

# 详细模式，允许注释和空白
pattern = re.compile(r'''
    Python      # 匹配"Python"
    \s+         # 一个或多个空白字符
    \d\.\d+     # 版本号，如3.9
    ''', re.VERBOSE)
match = pattern.search("Python 3.9 is great")
print(match.group())  # Python 3.9

编译正则表达式

如果你需要多次使用同一个正则表达式，可以预编译它以提高性能。

import re

# 编译正则表达式
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

# 使用编译后的模式
text = "Contact us at info@example.com or support@company.org"
emails = email_pattern.findall(text)
print(emails)  # ['info@example.com', 'support@company.org']

# 检查是否匹配
is_valid = bool(email_pattern.match("user@domain.com"))
print(is_valid)  # True

实际应用案例

案例1：提取URL

import re

text = """
Visit our website at https://www.example.com.
For support, go to http://help.example.com/support.
Our API documentation is available at https://api.example.com/v2/docs.
"""

# 提取URL
url_pattern = re.compile(r'https?://[^\s]+')
urls = url_pattern.findall(text)
print(urls)
# ['https://www.example.com.', 'http://help.example.com/support.', 'https://api.example.com/v2/docs.']

# 清理URL（移除尾部的标点符号）
clean_urls = [re.sub(r'[.,]$', '', url) for url in urls]
print(clean_urls)
# ['https://www.example.com', 'http://help.example.com/support', 'https://api.example.com/v2/docs']

案例2：解析日志文件

import re

log_lines = [
    "2023-05-15 10:23:45 INFO User login successful: user123",
    "2023-05-15 10:24:12 ERROR Database connection failed",
    "2023-05-15 10:25:30 WARNING Disk space low: 15% remaining",
    "2023-05-15 10:26:45 INFO User logout: user123"
]

# 解析日志
log_pattern = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.+)')

for line in log_lines:
    match = log_pattern.match(line)
    if match:
        timestamp, level, message = match.groups()
        print(f"Time: {timestamp}, Level: {level}, Message: {message}")

# 提取所有错误消息
error_messages = [
    log_pattern.match(line).group(3)
    for line in log_lines
    if log_pattern.match(line) and log_pattern.match(line).group(2) == "ERROR"
]
print("Error messages:", error_messages)
# Error messages: ['Database connection failed']

案例3：验证和清理用户输入

import re

def validate_username(username):
    """验证用户名：只允许字母、数字和下划线，长度4-20个字符"""
    pattern = re.compile(r'^[a-zA-Z0-9_]{4,20}$')
    return bool(pattern.match(username))

def validate_email(email):
    """验证电子邮件地址"""
    pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    return bool(pattern.match(email))

def sanitize_html(text):
    """移除HTML标签"""
    pattern = re.compile(r'<[^>]+>')
    return pattern.sub('', text)

# 测试验证函数
usernames = ["user123", "user@123", "ab", "validusername_123"]
for username in usernames:
    print(f"{username}: {'Valid' if validate_username(username) else 'Invalid'}")
# user123: Valid
# user@123: Invalid
# ab: Invalid
# validusername_123: Valid

# 测试HTML清理
html = "<p>This is <strong>important</strong> text.</p>"
clean_text = sanitize_html(html)
print(clean_text)  # This is important text.

案例4：提取文本中的结构化数据

import re

# 提取产品信息
product_text = """
Product: iPhone 13
Price: $799.00
SKU: APPL-IPH-13-128
Available: Yes

Product: Samsung Galaxy S21
Price: $699.99
SKU: SMSNG-GS21-256
Available: No
"""

# 使用命名分组提取产品信息
product_pattern = re.compile(r'''
    Product:\s+(?P<name>[\w\s]+)\n
    Price:\s+\$(?P<price>[\d.]+)\n
    SKU:\s+(?P<sku>[\w-]+)\n
    Available:\s+(?P<available>Yes|No)
''', re.VERBOSE)

products = []
for match in product_pattern.finditer(product_text):
    product = match.groupdict()
    product['price'] = float(product['price'])
    product['available'] = product['available'] == 'Yes'
    products.append(product)

for product in products:
    print(f"Name: {product['name'].strip()}")
    print(f"Price: ${product['price']}")
    print(f"SKU: {product['sku']}")
    print(f"Available: {'Yes' if product['available'] else 'No'}")
    print()

性能考虑

正则表达式功能强大，但使用不当可能导致性能问题。

避免灾难性回溯

某些正则表达式模式可能导致灾难性回溯，特别是在处理长文本时：

import re
import time

# 可能导致灾难性回溯的模式
bad_pattern = re.compile(r'(a+)+b')

# 创建一个不会匹配的长字符串
text = 'a' * 30

start_time = time.time()
try:
    # 设置超时
    result = bad_pattern.match(text)
    print(f"匹配结果: {result}")
except Exception as e:
    print(f"发生异常: {e}")
finally:
    print(f"执行时间: {time.time() - start_time:.6f}秒")

优化技巧

使用非捕获分组：当不需要引用分组内容时，使用(?:...)代替(...)。
避免过度使用贪婪量词：特别是在处理大文本时。
尽可能具体：使用更具体的模式可以减少回溯。
预编译正则表达式：重复使用同一模式时，预编译可以提高性能。
考虑替代方案：对于简单的字符串操作，内置的字符串方法可能更高效。

调试正则表达式

调试复杂的正则表达式可能很困难。以下是一些有用的技巧：

使用re.DEBUG标志

1
2
3

import re

pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', re.DEBUG)

使用在线工具

有许多在线工具可以帮助可视化和测试正则表达式，如regex101.com。

分步构建

对于复杂的正则表达式，从简单部分开始，逐步构建和测试：

import re

# 步骤1：匹配用户名部分
username_pattern = r'[A-Za-z0-9._%+-]+'
test_string = "user.name_123"
print(re.match(username_pattern, test_string).group())  # user.name_123

# 步骤2：添加@符号
username_domain_pattern = username_pattern + r'@'
test_string = "user.name_123@"
print(re.match(username_domain_pattern, test_string).group())  # user.name_123@

# 步骤3：添加域名部分
email_pattern = username_domain_pattern + r'[A-Za-z0-9.-]+'
test_string = "user.name_123@example"
print(re.match(email_pattern, test_string).group())  # user.name_123@example

# 步骤4：添加顶级域名
full_email_pattern = email_pattern + r'\.[A-Z|a-z]{2,}'
test_string = "user.name_123@example.com"
print(re.match(full_email_pattern, test_string).group())  # user.name_123@example.com

结论

正则表达式是处理文本的强大工具，掌握它可以大大提高你的文本处理效率。Python的re模块提供了全面的正则表达式支持，从基本的模式匹配到复杂的文本提取和验证。

虽然正则表达式有时看起来晦涩难懂，但通过实践和理解基本概念，你可以逐渐掌握这一强大工具。记住，编写清晰、高效的正则表达式是一种平衡艺术，需要在表达能力、可读性和性能之间找到平衡点。

你有什么正则表达式的问题或使用技巧想分享吗？欢迎在评论中讨论！title: Python正则表达式精通指南：从基础到高级应用
date: 2023-05-20 14:30:00
categories: python
tags: [正则表达式, 文本处理, 模式匹配, re模块, 字符串]

Python正则表达式精通指南：从基础到高级应用

正则表达式基础

什么是正则表达式？

正则表达式（Regular Expression，简称regex）是一种用于描述字符串模式的特殊语法。它可以用来搜索、替换和验证文本。

Python中的re模块

Python的re模块提供了使用正则表达式的接口：

import re

# 搜索模式
result = re.search(r'pattern', 'string to search')

# 匹配开头
result = re.match(r'pattern', 'string to match')

# 查找所有匹配
results = re.findall(r'pattern', 'string to find all matches')

# 替换
new_string = re.sub(r'pattern', 'replacement', 'string to modify')

# 分割
parts = re.split(r'pattern', 'string to split')

基本模式匹配

最简单的正则表达式就是直接匹配字符：

import re

# 搜索单词"Python"
result = re.search(r'Python', 'I love Python programming')
print(result)  # <re.Match object; span=(7, 13), match='Python'>

# 获取匹配的字符串
print(result.group())  # Python

# 获取匹配的位置
print(result.start(), result.end())  # 7 13

元字符和特殊序列

正则表达式的强大之处在于它的元字符和特殊序列。

常用元字符

元字符	描述
`.`	匹配除换行符外的任何字符
`^`	匹配字符串的开始
`$`	匹配字符串的结束
`*`	匹配前面的模式零次或多次
`+`	匹配前面的模式一次或多次
`?`	匹配前面的模式零次或一次
`{n}`	精确匹配前面的模式n次
`{n,}`	匹配前面的模式至少n次
`{n,m}`	匹配前面的模式n到m次
`\`	转义字符
`[]`	字符集，匹配括号内的任一字符
`\|`	或运算符，匹配`\|`前或后的模式
`()`	分组

特殊序列

特殊序列	描述
`\d`	匹配任何十进制数字，相当于`[0-9]`
`\D`	匹配任何非数字字符，相当于`[^0-9]`
`\s`	匹配任何空白字符，相当于`[ \t\n\r\f\v]`
`\S`	匹配任何非空白字符，相当于`[^ \t\n\r\f\v]`
`\w`	匹配任何字母数字字符，相当于`[a-zA-Z0-9_]`
`\W`	匹配任何非字母数字字符，相当于`[^a-zA-Z0-9_]`
`\b`	匹配单词边界
`\B`	匹配非单词边界

实例演示

import re

text = "Python 3.9 was released on 2020-10-05, Python 3.10 on 2021-10-04."

# 匹配所有数字
digits = re.findall(r'\d', text)
print(digits)  # ['3', '9', '2', '0', '2', '0', '1', '0', '0', '5', '3', '1', '0', '2', '0', '2', '1', '1', '0', '0', '4']

# 匹配所有数字序列
numbers = re.findall(r'\d+', text)
print(numbers)  # ['3', '9', '2020', '10', '05', '3', '10', '2021', '10', '04']

# 匹配日期格式
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(dates)  # ['2020-10-05', '2021-10-04']

# 匹配Python版本
versions = re.findall(r'Python \d\.\d+', text)
print(versions)  # ['Python 3.9', 'Python 3.10']

字符集和范围

字符集允许你指定一组可能的字符。

基本字符集

import re

text = "The quick brown fox jumps over the lazy dog."

# 匹配元音字母
vowels = re.findall(r'[aeiou]', text)
print(vowels)  # ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'e', 'a', 'o']

# 匹配辅音字母
consonants = re.findall(r'[bcdfghjklmnpqrstvwxyz]', text)
print(len(consonants))  # 21

范围

# 匹配所有小写字母
lowercase = re.findall(r'[a-z]', text)
print(len(lowercase))  # 35

# 匹配所有大写字母
uppercase = re.findall(r'[A-Z]', text)
print(uppercase)  # ['T']

# 匹配所有字母和数字
alphanumeric = re.findall(r'[a-zA-Z0-9]', text)
print(len(alphanumeric))  # 36

否定字符集

# 匹配非空白字符
non_whitespace = re.findall(r'[^\s]', text)
print(len(non_whitespace))  # 36

# 匹配非字母字符
non_alpha = re.findall(r'[^a-zA-Z]', text)
print(non_alpha)  # [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']

分组和捕获

分组允许你将正则表达式的一部分视为一个单元，可以用于捕获匹配的子字符串。

基本分组

import re

text = "Python 3.9 was released on 2020-10-05"

# 捕获版本号
match = re.search(r'Python (\d\.\d+)', text)
if match:
    version = match.group(1)
    print(version)  # 3.9

命名分组

# 使用命名分组
match = re.search(r'Python (?P<version>\d\.\d+) was released on (?P<date>\d{4}-\d{2}-\d{2})', text)
if match:
    version = match.group('version')
    date = match.group('date')
    print(f"Version: {version}, Release date: {date}")
    # Version: 3.9, Release date: 2020-10-05

非捕获分组

有时你需要分组但不需要捕获匹配的内容：

1
2
3

# 非捕获分组
results = re.findall(r'Python (?:\d\.\d+)', text)
print(results)  # ['Python 3.9']

反向引用

你可以在模式中引用之前捕获的分组：

# 查找重复的单词
text = "The the quick brown fox"
match = re.search(r'\b(\w+)\s+\1\b', text, re.IGNORECASE)
if match:
    print(f"重复的单词: {match.group(1)}")
    # 重复的单词: The

贪婪与非贪婪匹配

默认情况下，量词是贪婪的，它们会尽可能多地匹配字符。

贪婪匹配

import re

text = "<div>Content 1</div><div>Content 2</div>"

# 贪婪匹配
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group())  # <div>Content 1</div><div>Content 2</div>

非贪婪匹配

1
2
3

# 非贪婪匹配
non_greedy = re.search(r'<div>.*?</div>', text)
print(non_greedy.group())  # <div>Content 1</div>

前瞻和后顾

前瞻和后顾允许你基于前面或后面的内容来匹配文本，但不包括这些内容在匹配结果中。

正向前瞻

import re

text = "Python is great, javascript is also great"

# 匹配后面跟着"is great"的单词
results = re.findall(r'\w+(?= is great)', text)
print(results)  # ['Python', 'javascript']

负向前瞻

1
2
3

# 匹配后面不跟着"is great"的单词
results = re.findall(r'\w+(?! is great)', text)
print(results)  # ['is', 'great', 'javascript', 'is', 'also', 'great']

正向后顾和负向后顾

注意：Python的re模块不支持后顾断言，但从Python 3.6开始，可以使用有限形式的后顾。

# 匹配前面是"Python"的单词（Python 3.6+）
text = "Python is great, Python was created by Guido"
results = re.findall(r'(?<=Python )\w+', text)
print(results)  # ['is', 'was']

# 匹配前面不是"Python"的单词（Python 3.6+）
results = re.findall(r'(?<!Python )\w+', text)
print(results)  # ['Python', 'great', 'Python', 'created', 'by', 'Guido']

标志和选项

正则表达式可以使用各种标志来修改其行为。

常用标志

import re

text = """Python is case-sensitive.
PYTHON is uppercase.
python is lowercase."""

# 忽略大小写
results = re.findall(r'python', text, re.IGNORECASE)
print(results)  # ['Python', 'PYTHON', 'python']

# 多行模式
results = re.findall(r'^python', text, re.MULTILINE)
print(results)  # ['python']

# 点号匹配所有字符，包括换行符
text = "Python spans\nmultiple lines"
results = re.findall(r'spans.multiple', text, re.DOTALL)
print(results)  # ['spans\nmultiple']

# 详细模式，允许注释和空白
pattern = re.compile(r'''
    Python      # 匹配"Python"
    \s+         # 一个或多个空白字符
    \d\.\d+     # 版本号，如3.9
    ''', re.VERBOSE)
match = pattern.search("Python 3.9 is great")
print(match.group())  # Python 3.9

编译正则表达式

如果你需要多次使用同一个正则表达式，可以预编译它以提高性能。

import re

# 编译正则表达式
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

# 使用编译后的模式
text = "Contact us at info@example.com or support@company.org"
emails = email_pattern.findall(text)
print(emails)  # ['info@example.com', 'support@company.org']

# 检查是否匹配
is_valid = bool(email_pattern.match("user@domain.com"))
print(is_valid)  # True

实际应用案例

案例1：提取URL

import re

text = """
Visit our website at https://www.example.com.
For support, go to http://help.example.com/support.
Our API documentation is available at https://api.example.com/v2/docs.
"""

# 提取URL
url_pattern = re.compile(r'https?://[^\s]+')
urls = url_pattern.findall(text)
print(urls)
# ['https://www.example.com.', 'http://help.example.com/support.', 'https://api.example.com/v2/docs.']

# 清理URL（移除尾部的标点符号）
clean_urls = [re.sub(r'[.,]$', '', url) for url in urls]
print(clean_urls)
# ['https://www.example.com', 'http://help.example.com/support', 'https://api.example.com/v2/docs']

案例2：解析日志文件

import re

log_lines = [
    "2023-05-15 10:23:45 INFO User login successful: user123",
    "2023-05-15 10:24:12 ERROR Database connection failed",
    "2023-05-15 10:25:30 WARNING Disk space low: 15% remaining",
    "2023-05-15 10:26:45 INFO User logout: user123"
]

# 解析日志
log_pattern = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.+)')

for line in log_lines:
    match = log_pattern.match(line)
    if match:
        timestamp, level, message = match.groups()
        print(f"Time: {timestamp}, Level: {level}, Message: {message}")

# 提取所有错误消息
error_messages = [
    log_pattern.match(line).group(3)
    for line in log_lines
    if log_pattern.match(line) and log_pattern.match(line).group(2) == "ERROR"
]
print("Error messages:", error_messages)
# Error messages: ['Database connection failed']

案例3：验证和清理用户输入

import re

def validate_username(username):
    """验证用户名：只允许字母、数字和下划线，长度4-20个字符"""
    pattern = re.compile(r'^[a-zA-Z0-9_]{4,20}$')
    return bool(pattern.match(username))

def validate_email(email):
    """验证电子邮件地址"""
    pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    return bool(pattern.match(email))

def sanitize_html(text):
    """移除HTML标签"""
    pattern = re.compile(r'<[^>]+>')
    return pattern.sub('', text)

# 测试验证函数
usernames = ["user123", "user@123", "ab", "validusername_123"]
for username in usernames:
    print(f"{username}: {'Valid' if validate_username(username) else 'Invalid'}")
# user123: Valid
# user@123: Invalid
# ab: Invalid
# validusername_123: Valid

# 测试HTML清理
html = "<p>This is <strong>important</strong> text.</p>"
clean_text = sanitize_html(html)
print(clean_text)  # This is important text.

案例4：提取文本中的结构化数据

import re

# 提取产品信息
product_text = """
Product: iPhone 13
Price: $799.00
SKU: APPL-IPH-13-128
Available: Yes

Product: Samsung Galaxy S21
Price: $699.99
SKU: SMSNG-GS21-256
Available: No
"""

# 使用命名分组提取产品信息
product_pattern = re.compile(r'''
    Product:\s+(?P<name>[\w\s]+)\n
    Price:\s+\$(?P<price>[\d.]+)\n
    SKU:\s+(?P<sku>[\w-]+)\n
    Available:\s+(?P<available>Yes|No)
''', re.VERBOSE)

products = []
for match in product_pattern.finditer(product_text):
    product = match.groupdict()
    product['price'] = float(product['price'])
    product['available'] = product['available'] == 'Yes'
    products.append(product)

for product in products:
    print(f"Name: {product['name'].strip()}")
    print(f"Price: ${product['price']}")
    print(f"SKU: {product['sku']}")
    print(f"Available: {'Yes' if product['available'] else 'No'}")
    print()

性能考虑

正则表达式功能强大，但使用不当可能导致性能问题。

避免灾难性回溯

某些正则表达式模式可能导致灾难性回溯，特别是在处理长文本时：

import re
import time

# 可能导致灾难性回溯的模式
bad_pattern = re.compile(r'(a+)+b')

# 创建一个不会匹配的长字符串
text = 'a' * 30

start_time = time.time()
try:
    # 设置超时
    result = bad_pattern.match(text)
    print(f"匹配结果: {result}")
except Exception as e:
    print(f"发生异常: {e}")
finally:
    print(f"执行时间: {time.time() - start_time:.6f}秒")

优化技巧

使用非捕获分组：当不需要引用分组内容时，使用(?:...)代替(...)。
避免过度使用贪婪量词：特别是在处理大文本时。
尽可能具体：使用更具体的模式可以减少回溯。
预编译正则表达式：重复使用同一模式时，预编译可以提高性能。
考虑替代方案：对于简单的字符串操作，内置的字符串方法可能更高效。

调试正则表达式

调试复杂的正则表达式可能很困难。以下是一些有用的技巧：

使用re.DEBUG标志

1
2
3

import re

pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', re.DEBUG)

使用在线工具

有许多在线工具可以帮助可视化和测试正则表达式，如regex101.com。

分步构建

对于复杂的正则表达式，从简单部分开始，逐步构建和测试：

import re

# 步骤1：匹配用户名部分
username_pattern = r'[A-Za-z0-9._%+-]+'
test_string = "user.name_123"
print(re.match(username_pattern, test_string).group())  # user.name_123

# 步骤2：添加@符号
username_domain_pattern = username_pattern + r'@'
test_string = "user.name_123@"
print(re.match(username_domain_pattern, test_string).group())  # user.name_123@

# 步骤3：添加域名部分
email_pattern = username_domain_pattern + r'[A-Za-z0-9.-]+'
test_string = "user.name_123@example"
print(re.match(email_pattern, test_string).group())  # user.name_123@example

# 步骤4：添加顶级域名
full_email_pattern = email_pattern + r'\.[A-Z|a-z]{2,}'
test_string = "user.name_123@example.com"
print(re.match(full_email_pattern, test_string).group())  # user.name_123@example.com

结论

你有什么正则表达式的问题或使用技巧想分享吗？欢迎在评论中讨论！

Python正则表达式精通指南：从基础到高级应用

正则表达式基础

什么是正则表达式？

Python中的re模块

基本模式匹配

元字符和特殊序列

常用元字符

特殊序列

实例演示

字符集和范围

基本字符集

范围

否定字符集

分组和捕获

基本分组

命名分组

非捕获分组

反向引用

贪婪与非贪婪匹配

贪婪匹配

非贪婪匹配

前瞻和后顾

正向前瞻

负向前瞻

正向后顾和负向后顾

标志和选项

常用标志

编译正则表达式

实际应用案例

案例1：提取URL

案例2：解析日志文件

案例3：验证和清理用户输入

案例4：提取文本中的结构化数据

性能考虑

避免灾难性回溯

优化技巧

调试正则表达式

使用re.DEBUG标志

使用在线工具

分步构建

结论

你有什么正则表达式的问题或使用技巧想分享吗？欢迎在评论中讨论！title: Python正则表达式精通指南：从基础到高级应用date: 2023-05-20 14:30:00categories: pythontags: [正则表达式, 文本处理, 模式匹配, re模块, 字符串]

Python正则表达式精通指南：从基础到高级应用

正则表达式基础

什么是正则表达式？

Python中的re模块

基本模式匹配

元字符和特殊序列

常用元字符

特殊序列

实例演示

字符集和范围

基本字符集

范围

否定字符集

分组和捕获

基本分组

命名分组

非捕获分组

反向引用

贪婪与非贪婪匹配

贪婪匹配

非贪婪匹配

前瞻和后顾

正向前瞻

负向前瞻

正向后顾和负向后顾

标志和选项

常用标志

编译正则表达式

实际应用案例

案例1：提取URL

案例2：解析日志文件

案例3：验证和清理用户输入

案例4：提取文本中的结构化数据

性能考虑

避免灾难性回溯

优化技巧

调试正则表达式

使用re.DEBUG标志

你有什么正则表达式的问题或使用技巧想分享吗？欢迎在评论中讨论！title: Python正则表达式精通指南：从基础到高级应用
date: 2023-05-20 14:30:00
categories: python
tags: [正则表达式, 文本处理, 模式匹配, re模块, 字符串]