您好,登錄后才能下訂單哦!
這篇文章主要介紹“怎么使用Python中的正則表達(dá)式處理html文件”的相關(guān)知識(shí),小編通過(guò)實(shí)際案例向大家展示操作過(guò)程,操作方法簡(jiǎn)單快捷,實(shí)用性強(qiáng),希望這篇“怎么使用Python中的正則表達(dá)式處理html文件”文章能幫助大家解決問(wèn)題。
使用Python中的正則表達(dá)式處理html文件
finditer方法是一種全匹配方法。您可能已經(jīng)使用了findall方法,它返回多個(gè)匹配字符串的列表。finditer返回一個(gè)迭代器順序地為多個(gè)匹配中的每一個(gè)生成匹配對(duì)象。在下面的代碼中,這些匹配對(duì)象被訪問(wèn)(通過(guò)for循環(huán)),因此可以打印組1。
您的任務(wù)是編寫(xiě)Python RE來(lái)識(shí)別HTML文本文件中的某些模式。將代碼添加到STARTER腳本為這些模式編譯RE(將它們分配給有意義的變量名稱(chēng)),并將這些RE應(yīng)用于文件的每一行,打印出找到的匹配項(xiàng)。
1.編寫(xiě)識(shí)別HTML標(biāo)簽的模式,然后將其打印為“TAG:TAG string”(例如“TAG:b”代表標(biāo)簽)。為了簡(jiǎn)單起見(jiàn),假設(shè)左括號(hào)和右括號(hào)每個(gè)標(biāo)記的(<,>)將始終出現(xiàn)在同一行文本中。第一次嘗試可能使regex“<.*>”其中“.”是與任何字符匹配的預(yù)定義字符類(lèi)符號(hào)。嘗試找出這一點(diǎn),找出為什么這不是一個(gè)好的解決方案。編寫(xiě)一個(gè)更好的解決方案,解決這個(gè)問(wèn)題
2.修改代碼,使其區(qū)分開(kāi)頭和結(jié)尾標(biāo)記(例如p與/p)打印OPENTAG和CLOSETAG
import sys, re #------------------------------ testRE = re.compile('(logic|sicstus)', re.I) testI = re.compile('<[A-Za-z]>', re.I) testO = re.compile('<[^/](\S*?)[^>]*>') testC = re.compile('</(\S*?)[^>]*>') with open('RGX_DATA.html') as infs: linenum = 0 for line in infs: linenum += 1 if line.strip() == '': continue print(' ', '-' * 100, '[%d]' % linenum, '\n TEXT:', line, end='') m = testRE.search(line) if m: print('** TEST-RE:', m.group(1)) mm = testRE.finditer(line) for m in mm: print('** TEST-RE:', m.group(1)) index= testI.finditer(line) for i in index: print('Tag:',i.group().replace('<', '').replace('>', '')) open1= testO.finditer(line) for m in open1: print('opening:',m.group().replace('<', '').replace('>', '')) close1= testC.finditer(line) for n in close1: print('closing:',n.group().replace('<', '').replace('>', ''))
請(qǐng)注意,有些HTML標(biāo)簽有參數(shù),例如:
<table border=1 cellspacing=0 cellpadding=8>
確保打開(kāi)標(biāo)記的模式適用于帶參數(shù)和不帶參數(shù)的標(biāo)記,即成功找到并打印標(biāo)簽標(biāo)簽。現(xiàn)在擴(kuò)展您的代碼,以便打印兩個(gè)打開(kāi)的標(biāo)簽標(biāo)簽和參數(shù),例如:
OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8
open1= testO.finditer(line) for m in open1: #print('opening:',m.group().replace('<', '').replace('>', '')) firstm= m.group().replace('<', '').replace('>', '').split() num = 0 for otherm in firstm: if num == 0: print('opening:',otherm) else: print('pram:',otherm) num+= 1
在正則表達(dá)式中,可以使用反向引用來(lái)指示匹配早期部分的子字符串,應(yīng)再次出現(xiàn)正則表達(dá)式的。格式為\N(其中N為正整數(shù)),并返回到第N個(gè)匹配的文本正則表達(dá)式組。例如,正則表達(dá)式,如:r" (\w+) \1 僅當(dāng)與組(\w+)完全匹配的字符串再次出現(xiàn)時(shí)才匹配 backref\1出現(xiàn)的位置。這可能與字符串“踢”匹配.例如,“the”出現(xiàn)兩次。使用反向引用編寫(xiě)一個(gè)模式,當(dāng)一行包含成對(duì)的open和關(guān)閉標(biāo)簽,例如在粗體中.
考慮到我們可能想要?jiǎng)?chuàng)建一個(gè)執(zhí)行HTML剝離的腳本,即一個(gè)HTML文件,并返回一個(gè)純文本文件,所有HTML標(biāo)記都已從中刪除出來(lái)這里我們不打算這樣做,而是考慮一個(gè)更簡(jiǎn)單的例子,即刪除我們?cè)谳斎霐?shù)據(jù)文件的任何行中找到的HTML標(biāo)記。
你應(yīng)該能夠讓您已經(jīng)定義的RE識(shí)別HTML標(biāo)簽這樣做,將生成的文本打印到屏幕上為STRIPPED:。。
import sys, re #------------------------------ # PART 1: # Key thing is to avoid matching strings that include # multiple tags, e.g. treating '<p><b>' as a single # tag. Can do this in several ways. Firstly, use # non-greedy matching, so get shortest possible match # including the two angle brackets: tag = re.compile('</?(.*?)>') # The above treats the '/' of a close tag as a separate # optional component - so that this doesn't turn up as # part of the match '.group(1)', which is meant to return # the tag label. # Following alternative solution uses a negated character # class to explicitly prevent this including '>': tag = re.compile('</?([^>]+)>') # Finally, following version separates finding the tag # label string from any (optional) parameters that might # also appear before the close angle bracket: tag = re.compile(r'</?(\w+\b)([^>]+)?>') # Note that use of '\b' (as word boundary anchor) here means # we must mark the regex string as a 'raw' string (r'..'). #------------------------------ # PART 2: # Following closeTag definition requires first first char # after the open angle bracket to be '/', while openTag # definition excludes this by requiring first char to be # a 'word char' (\w): openTag = re.compile(r'<(\w[^>]*)>') closeTag = re.compile(r'</([^>]*)>') # Following revised definitions are more carefully stated # for correct extraction of tag label (separately from # any parameters: openTag = re.compile(r'<(\w+\b)([^>]+)?>') closeTag = re.compile(r'</(\w+\b)\s*>') #------------------------------ # PART 3: # Above openTag definition will already get the string # encompassing any parameters, and return it as # m.group(2), i.e. defn: openTag = re.compile(r'<(\w+\b)([^>]+)?>') # If assume that parameters are continuous non-whitespace # chars separated by whitespace chars, then we can divide # them up using split - and that's how we handle them # here. (In reality, parameter strings can be a lot more # messy than this, but we won't try to deal with that.) #------------------------------ # PART 4: openCloseTagPair = re.compile(r'<(\w+\b)([^>]+)?>(.*?)</\1\s*>') # Note use of non-greedy matching for the text falling # *between* the open/close tag pair - to avoid false # results where have two similar tag pairs on same line. #------------------------------ # PART 5: URLS # This is quite tricky. The URL expressions in the file # are of two kinds, of which the first is a string # between double quotes ("..") which may include # whitespace. For this case we might have a regex: url = re.compile('href=("[^">]+")', re.I) # The second case does not have quotes, and does not # allow whitespace, consisting of a continuous sequence # of non-whitespace material (that ends when you reach a # space or close bracket '>'). This might be: url = re.compile('href=([^">\s]+)', re.I) # We can combine these two cases as follows, and still # get the expression back as group(1): url = re.compile(r'href=("[^">]+"|[^">\s]+)', re.I) # Note that I've done nothing here to exclude 'mailto:' # links as being accepted as URLS. #------------------------------ with open('RGX_DATA.html') as infs: linenum = 0 for line in infs: linenum += 1 if line.strip() == '': continue print(' ', '-' * 100, '[%d]' % linenum, '\n TEXT:', line, end='') # PART 1: find HTML tags # (The following uses 'finditer' to find ALL matches # within the line) mm = tag.finditer(line) for m in mm: print('** TAG:', m.group(1), ' + [%s]' % m.group(2)) # PART 2,3: find open/close tags (+ params of open tags) mm = openTag.finditer(line) for m in mm: print('** OPENTAG:', m.group(1)) if m.group(2): for param in m.group(2).split(): print(' PARAM:', param) mm = closeTag.finditer(line) for m in mm: print('** CLOSETAG:', m.group(1)) # PART 4: find open/close tag pairs appearing on same line mm = openCloseTagPair.finditer(line) for m in mm: print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3))) # PART 5: find URLs: mm = url.finditer(line) for m in mm: print('** URL:', m.group(1)) # PART 6: Strip out HTML tags (note that .sub will do all # possible substitutions, unless number is limited by count # keyword arg - which is fortunately what we want here) stripped = tag.sub('', line) print('** STRIPPED:', stripped, end = '')
關(guān)于“怎么使用Python中的正則表達(dá)式處理html文件”的內(nèi)容就介紹到這里了,感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識(shí),可以關(guān)注億速云行業(yè)資訊頻道,小編每天都會(huì)為大家更新不同的知識(shí)點(diǎn)。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。