Skip to content

Conversation

@Pokechu22
Copy link
Contributor

Before, files with non-ascii names would crash when using --xmlrevisions (a regression from #451 by @yzqzss). For example, French Wikiversity has Fichier:Ensemble_access_croisé.jpg (I'm using that as an example here because it's a small site that I know will continue to exist).

$ python2 -b -u dumpgenerator.py --xmlrevisions --images --force --api https://fr.wikiversity.org/w/api.php
Checking API... https://fr.wikiversity.org/w/api.php
API is OK: https://fr.wikiversity.org/w/api.php
Checking index.php... https://fr.wikiversity.org/w/index.php
index.php is OK
PLEASE, DO NOT USE THIS SCRIPT TO DOWNLOAD WIKIMEDIA PROJECTS!
Download the dumps from http://dumps.wikimedia.org
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://fr.wikiversity.org/w/api.php
Trying generating a new dump into a new directory...
Retrieving image filenames
..    Found 84 images
84 image names loaded
Image filenames and URLs saved at... frwikiversityorg_w-20230624-images.txt
Retrieving images from "start"
Creating "./frwikiversityorg_w-20230624-wikidump/images" directory
    Downloaded 10 images
    Downloaded 20 images
    Downloaded 30 images
    Downloaded 40 images
/usr/lib/python2.7/urllib.py:1306: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "dumpgenerator.py", line 2572, in <module>
    main()
  File "dumpgenerator.py", line 2564, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 2147, in createNewDump
    session=other['session'])
  File "dumpgenerator.py", line 1524, in generateImageDump
    r = session.get(config['api'] + u"?action=query&export&exportnowrap&titles=%s" % urllib.quote(title))
  File "/usr/lib/python2.7/urllib.py", line 1306, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xe9'

Python 2's str vs unicode and the implicit conversions are a mess.

@nemobis
Copy link
Member

nemobis commented Jun 25, 2023

Ah yes, that error has been annoying. Thanks!

Do we really need to pass the URL as unicode? We have the titles as unicode because we're trying to convert at the I/O boundary (so that e.g. we compare titles in unicode only), but I'm not sure requests needs it.

@Pokechu22
Copy link
Contributor Author

config[api] is str, so probably the URL should also be str, good catch.

@Pokechu22 Pokechu22 force-pushed the url-quote-str-not-unicode branch from 799bce4 to 98626d2 Compare June 25, 2023 18:11
@Pokechu22
Copy link
Contributor Author

Anything holding up this being merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants