Polskie znaki przy parsowaniu danych z opensubtitles

Polskie znaki przy parsowaniu danych z opensubtitles
P2
  • Rejestracja: dni
  • Ostatnio: dni
  • Postów: 2
0
Kopiuj
import fileinput
from os import listdir
import sys
import glob
import xmltodict
import _json

years = range(0, 2000)
#years = range(1984, 1985)

# can't even deal right now
# 1951 deleted
# 1953 deleted

for xyear in years:

	year = './xml/pl/%d' % xyear

	for movieDir in glob.glob(year + '/*' * 1):
		movieFiles = listdir(movieDir)

		script = movieFiles[0]
		if(script.endswith(".xml") is False):
			continue

		text = ""
		f=open(movieDir+'/'+script, encoding='utf8')
		for line in f.readlines():
			text += str(line)

		print(movieDir+'/'+script)

		print('bytes '+text)

		from lxml import etree
		root = etree.fromstring(text)
		result = ""
		tmp = []
		for x in root.xpath('//document/s/w'):
			tmp.append(x.text)
		for i in range(len(tmp)-1):
			result += tmp[i]
			if tmp[i+1] == None:
				continue
			char = tmp[i+1][0]
			if (char >= 'A' and char <='z') or (char >= '0' and char <='9'):

				result += ' '
		#print result
		with open(movieDir+'/subtitle.txt', 'w') as g:
			g.write(result.encode('utf-8', 'ignore'))

Plik XML wygląda w ten sposób

Kopiuj

<?xml version="1.0" encoding="utf-8"?>
<document>
  <s id="1">
    <time id="T1S" value="00:01:25,772" />
    <w id="1.1">Na</w>
    <w id="1.2">klifie</w>
    <w id="1.3">Ningbi</w>
    <w id="1.4">jest</w>
    <w id="1.5">świątynia</w>
    <time id="T1E" value="00:01:28,223" />
  </s>
...

Cześć, chciałem przeparsować dane pobrane z opensubtitles, na format txt/jsona, który skupiałby same napisy, jednak napotkałem ten błąd:

Kopiuj
Traceback (most recent call last):
  File "C:\Users\Dom\anaconda3\lib\site-packages\xmltodict.py", line 5, in <module>
    from defusedexpat import pyexpat as expat
ModuleNotFoundError: No module named 'defusedexpat'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/Dom/Desktop/OpenSubtitles-master/xml.py", line 7, in <module>
    import xmltodict
  File "C:\Users\Dom\anaconda3\lib\site-packages\xmltodict.py", line 7, in <module>
    from xml.parsers import expat
  File "C:\Users\Dom\Desktop\OpenSubtitles-master\xml.py", line 38, in <module>
    root = etree.fromstring(text)
  File "src/lxml/etree.pyx", line 3235, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Czy ktoś jest w stanie mi powiedzieć, co powinienem poprawić?

P2
  • Rejestracja: dni
  • Ostatnio: dni
  • Postów: 2
1

Rozwiązaniem była zamiana

Kopiuj
        root = etree.fromstring(text)

na

Kopiuj
			data = open(movieDir + '/' + script, "rb")
			xslt_content = data.read()
			root = etree.XML(xslt_content)

Zarejestruj się i dołącz do największej społeczności programistów w Polsce.

Otrzymaj wsparcie, dziel się wiedzą i rozwijaj swoje umiejętności z najlepszymi.