Skip to content

Remove (or prevent) duplicate media files #37

@sjacks26

Description

@sjacks26

Right now, getmedia.py downloads all media files that it can. But with retweets and resharing images and other forms of media, we might not need to keep all copies of media files.
Instead, we might want to create unique identifiers for all unique media files, then link those unique identifiers with each tweet in which that media appears.

If that's the case, we might hash media files to get a bit-level representation of unique files; store all unique hashes somewhere (perhaps alongside the main data collection, like we do with stream limit messages); compare new media file hashes against the list of existing media file hashes; and only retain unique media files.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions