scraped telegram posts having line breaks like the original post#687
scraped telegram posts having line breaks like the original post#687mohammadali-seifkashani wants to merge 5 commits intoJustAnotherArchivist:masterfrom
Conversation
…r having line breaks in result
…elScraper._soup_to_items method
… - convert long lines to paragraphs
JustAnotherArchivist
left a comment
There was a problem hiding this comment.
Please revert the indentation and empty line changes, then I can review the other changes.
JustAnotherArchivist
left a comment
There was a problem hiding this comment.
The indentation is still wrong and the diff unreadable.
|
Please do a favor and check the code. |
JustAnotherArchivist
left a comment
There was a problem hiding this comment.
Thanks for the cleanup. The indentation of the new function is still incorrect though as it uses spaces, not tabs, so the code is a syntax error currently. Other things below.
| return cls._cli_construct(args, args.channel) | ||
| return cls._cli_construct(args, args.channel) No newline at end of file |
There was a problem hiding this comment.
Still an undesired whitespace change. There should be a LF at the end of a (text-ish) file.
There was a problem hiding this comment.
(Looks like GitHub doesn't display this correctly on the PR page itself, only in the full diff: https://github.com/JustAnotherArchivist/snscrape/pull/687/files#diff-7f40c11448f92ed2f5d1764136d372d15faa3d4da0272813e88478c4d8870a09L203)
| result = [] | ||
| # Using the features of the BS4 module itself | ||
| for s in post.stripped_strings: | ||
| result.append(s) | ||
| return '\n'.join(result) |
There was a problem hiding this comment.
This can be simplified to '\n'.join(post.stripped_strings), but it doesn't do the right thing anyway. It splits out links into separate lines, and it doesn't preserve multiple line breaks. A good test case for both is https://t.me/telegram/201. Looks like this might require explicitly replacing the <br> tags.
| soup = bs4.BeautifulSoup(r.text, 'lxml') | ||
|
|
||
| @staticmethod | ||
| def get_post_text(post) -> str: |
There was a problem hiding this comment.
Should be message, not post, to avoid confusion with the variable in _soup_to_items. This should also not be public API. So _get_message_text(message).
|
Fix module site new one |
When I tested snscrape for Telegram channels, I recognized that the posts don't have line breaks. And this caused problem for my text analysis. So I created the static method "get_post_text()" for that.
I also did a little cleaning in file code which are as follows: