scraped telegram posts having line breaks like the original post by mohammadali-seifkashani · Pull Request #687 · JustAnotherArchivist/snscrape

mohammadali-seifkashani · 2023-01-25T06:04:44Z

When I tested snscrape for Telegram channels, I recognized that the posts don't have line breaks. And this caused problem for my text analysis. So I created the static method "get_post_text()" for that.

I also did a little cleaning in file code which are as follows:

Remove redundant parentheses in conditions
convert long lines to paragraphs

…r having line breaks in result

…elScraper._soup_to_items method

… - convert long lines to paragraphs

JustAnotherArchivist

Please revert the indentation and empty line changes, then I can review the other changes.

JustAnotherArchivist

The indentation is still wrong and the diff unreadable.

… change

mohammadali-seifkashani · 2023-02-12T20:14:57Z

Please do a favor and check the code.

JustAnotherArchivist

Thanks for the cleanup. The indentation of the new function is still incorrect though as it uses spaces, not tabs, so the code is a syntax error currently. Other things below.

JustAnotherArchivist · 2023-02-13T00:27:29Z

snscrape/modules/telegram.py

-		return cls._cli_construct(args, args.channel)
+		return cls._cli_construct(args, args.channel)


Still an undesired whitespace change. There should be a LF at the end of a (text-ish) file.

(Looks like GitHub doesn't display this correctly on the PR page itself, only in the full diff: https://github.com/JustAnotherArchivist/snscrape/pull/687/files#diff-7f40c11448f92ed2f5d1764136d372d15faa3d4da0272813e88478c4d8870a09L203)

JustAnotherArchivist · 2023-02-13T00:30:10Z

snscrape/modules/telegram.py

+        result = []
+        # Using the features of the BS4 module itself
+        for s in post.stripped_strings:
+            result.append(s)
+        return '\n'.join(result)


This can be simplified to '\n'.join(post.stripped_strings), but it doesn't do the right thing anyway. It splits out links into separate lines, and it doesn't preserve multiple line breaks. A good test case for both is https://t.me/telegram/201. Looks like this might require explicitly replacing the <br> tags.

JustAnotherArchivist · 2023-02-13T00:32:16Z

snscrape/modules/telegram.py

 			soup = bs4.BeautifulSoup(r.text, 'lxml')

+    @staticmethod
+    def get_post_text(post) -> str:


Should be message, not post, to avoid confusion with the variable in _soup_to_items. This should also not be public API. So _get_message_text(message).

bendasgfyug · 2026-02-19T11:10:03Z

Fix module site new one

mohammadali-seifkashani added 3 commits January 25, 2023 09:13

adding static method get_post_text to class TelegramChannelScraper fo…

6546279

…r having line breaks in result

remove extra line soup.get_text(separator="\n") from my TelegramChann…

117cab7

…elScraper._soup_to_items method

reformat telegram.py file: remove redundant parentheses in conditions…

70c1a1f

… - convert long lines to paragraphs

JustAnotherArchivist requested changes Jan 27, 2023

View reviewed changes

revert identations because of the request of the repository owner

ed3d520

mohammadali-seifkashani requested review from JustAnotherArchivist and removed request for JustAnotherArchivist February 4, 2023 15:29

JustAnotherArchivist requested changes Feb 4, 2023

View reviewed changes

Just copy the source code and add my function to it without any other…

41aeff1

… change

mohammadali-seifkashani requested a review from JustAnotherArchivist February 5, 2023 08:15

JustAnotherArchivist requested changes Feb 13, 2023

View reviewed changes

JustAnotherArchivist added enhancement New feature or request module:telegram labels Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scraped telegram posts having line breaks like the original post#687

scraped telegram posts having line breaks like the original post#687
mohammadali-seifkashani wants to merge 5 commits intoJustAnotherArchivist:masterfrom
mohammadali-seifkashani:master

mohammadali-seifkashani commented Jan 25, 2023

Uh oh!

JustAnotherArchivist left a comment

Uh oh!

JustAnotherArchivist left a comment

Uh oh!

mohammadali-seifkashani commented Feb 12, 2023

Uh oh!

JustAnotherArchivist left a comment

Uh oh!

JustAnotherArchivist Feb 13, 2023

Uh oh!

JustAnotherArchivist Feb 13, 2023

Uh oh!

JustAnotherArchivist Feb 13, 2023

Uh oh!

JustAnotherArchivist Feb 13, 2023

Uh oh!

bendasgfyug commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return cls._cli_construct(args, args.channel)
		return cls._cli_construct(args, args.channel) No newline at end of file

Conversation

mohammadali-seifkashani commented Jan 25, 2023

Uh oh!

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

Uh oh!

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

Uh oh!

mohammadali-seifkashani commented Feb 12, 2023

Uh oh!

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

Uh oh!

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

bendasgfyug commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants