-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathprocessor.py
More file actions
42 lines (33 loc) · 812 Bytes
/
processor.py
File metadata and controls
42 lines (33 loc) · 812 Bytes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# coding: utf-8
# In[2]:
import nltk
import re, pprint
import urllib as url
import urllib2 as url2
from bs4 import BeautifulSoup
import xml.etree.ElementTree
# In[3]:
'''
This function is used to get the root of the xml tree.
Input- filename
Output- root of xml tree
'''
def getroot_xml(filename):
root = xml.etree.ElementTree.parse('Posts_small.xml').getroot()
return root
# In[ ]:
'''
This function reads the text in questions
Input- root of xml tree
Output- Parsed question strings
'''
def get_questions(root):
questions = []
for row in root.findall('row'):
body = row.get("Body")
soup = BeautifulSoup(body)
[s.extract() for s in soup('code')]
question = soup.get_text()
print question
questions.append(question)
return questions