Conversation
…ss_file. This allows users to use the same args for processing data as librosa.load()
mbsantiago
left a comment
There was a problem hiding this comment.
Hi @kaviecos. Thanks so much for the PR. It looks pretty good and I left a few comments.
batdetect2/utils/detector_utils.py
Outdated
|
|
||
| _file_id = file_id | ||
| if _file_id is None: | ||
| _file_id = os.path.basename(path) if isinstance(path, str) else "unknown" |
There was a problem hiding this comment.
Would be better to provide a unique ID. One option is to use the uuid.uuid4 function of the standard library, another is to use the md5 or sha256 hash of the file contents. I would lean on using the md5 hash, as it uniquely depends on the file contents, meaning that the same id would be used if the same file is uploaded/processed again.
There was a problem hiding this comment.
I'm looking at an implementation of _generate_id(path) like this:
def _generate_id(path: Union[
str, int, os.PathLike[Any], sf.SoundFile, audioread.AudioFile, BinaryIO
]) -> str:
""" Generate an id based on the path.
If the path is a str or PathLike it will parsed as the basename.
This should ensure backwards compatibility with previous versions.
"""
if isinstance(path, str) or isinstance(path, os.PathLike):
return os.path.basename(path)
elif isinstance(path, BinaryIO):
hasher = hashlib.md5()
path.seek(0)
while chunk := path.read(8192):
hasher.update(chunk)
path.seek(0)
return hasher.hexdigest()
else:
return str(uuid.uuid4())
But I don't really like the md5 part, as there is some inconsitency in the behaviour. I'd prefer either; raise an exception if path is not string or PathLike (as we cannot get the basename in this case). This would make it clear to users that they have to provide an id in this case.
or return a uuidv4 - this, as you mention, has the downside of returning different ids on different runs of the same data.
There was a problem hiding this comment.
Also, I could add a function in api.py that takes a URL and calls du.process_file like this (file_id defaults to URL):
def process_url(
url: str,
model: DetectionModel = MODEL,
config: Optional[ProcessingConfiguration] = None,
device: torch.device = DEVICE,
file_id: str | None = None
) -> du.RunResults:
if config is None:
config = CONFIG
if file_id is None:
file_id = url
response = requests.get(url)
response.raise_for_status()
raw_audio_data = response.content
return du.process_file(
io.BytesIO(raw_audio_data),
model,
config,
device,
file_id
) There was a problem hiding this comment.
Can you please add tests that :
- processes a file loaded in a buffer using the modified
process_filefunction, and ideally compares the output when using just the path?
There was a problem hiding this comment.
I did. I also found a bug in get_samplerate. When the BytesIO object is read by librosa the cursor is moved, so when we call get_samplerate later we have to "seek(0)" to reset the cursor. I don't really like this, but I'm not sure I see a way around it.
…ed load_audio() implementation so that it uses load_audio_data but retains its signature. du.process_file() now does not need to call get_samplerate
|
@mbsantiago Hi again. I found a better way to solve the get_samplerate issue. I moved the implementation of And thanks a lot for taking your time to help! |
|
I fixed the syntax errors for Python3.9. |
|
Hi @kaviecos, sorry for the delay in the review. Had a couple of things cooking on the side, so had to take some time off, but I'm back. Thanks again for all the changes. They are all looking pretty good. Review in detail and merge soon after. |
|
No worries. |
|
@mbsantiago Any chance you'll have time to look at this? We'll soon(ish) be looking at running a lot of files through batdetect2 and it would be nice to be on the main branch :) |
mbsantiago
left a comment
There was a problem hiding this comment.
Hi @kaviecos,
Sorry for the big delay in getting back to this — I got sidetracked with a bunch of things that came up. I think the changes you’ve made look great, and based on my tests and review, I’m confident they won’t break any existing code. I’m happy to go ahead and merge them.
I’ve left a couple of very minor comments that I think could be helpful down the line. If you have a bit of time to take a look, I’d really appreciate it — let me know. Otherwise, we can go ahead and merge as is.
Huge thanks again for your work — I hope it’s useful for your use case, and I’m sure it’ll benefit others as well!
Cheers,
Santiago
batdetect2/api.py
Outdated
| def process_file( | ||
| audio_file: str, | ||
| path: Union[ | ||
| str, int, os.PathLike[Any], sf.SoundFile, audioread.AudioFile, BinaryIO |
There was a problem hiding this comment.
I believe BinaryIO covers the sf.SoundFile and audioread.AudioFile cases as they comply with the BinaryIO interface, so no need to include them :).
There was a problem hiding this comment.
I've copied the definition from librosa.load(). I think it makes sense to use the same definition, though I'm not sure how int can be useful. But I'll remove SoundFile and AudioFile if you prefer.
path: Union[
str, int, os.PathLike[Any], sf.SoundFile, audioread.AudioFile, BinaryIO
],
batdetect2/api.py
Outdated
| audio_file : str | ||
| Path to audio file. | ||
| path : Union[ | ||
| str, int, os.PathLike[Any], sf.SoundFile, audioread.AudioFile, BinaryIO |
There was a problem hiding this comment.
Would need updating. On a similar note, why is int expected here?
batdetect2/utils/audio_utils.py
Outdated
| def load_audio( | ||
| audio_file: str, | ||
| path: Union[ | ||
| str, int, os.PathLike[Any], sf.SoundFile, audioread.AudioFile, BinaryIO |
There was a problem hiding this comment.
Might be a good idea to make a Type alias for the Path type somewhere and reuse in all these functions for consistency and easy modification.
| # convert results to a dictionary in the right format | ||
| results = convert_results( | ||
| file_id=os.path.basename(audio_file), | ||
| file_id=_file_id, |
There was a problem hiding this comment.
Nice! I like the separation to a dedicated function
batdetect2/utils/detector_utils.py
Outdated
|
|
||
| return results | ||
|
|
||
| def _generate_id(path: Union[ |
batdetect2/utils/detector_utils.py
Outdated
| return results | ||
|
|
||
| def _generate_id(path: Union[ | ||
| str, int, os.PathLike[Any], sf.SoundFile, audioread.AudioFile, BinaryIO |
tests/test_api.py
Outdated
| assert len(results["pred_dict"]["annotation"]) == 0 | ||
|
|
||
| def test_process_file_file_id_defaults_to_basename(): | ||
| """Test that no detections are made above the nyquist frequency.""" |
There was a problem hiding this comment.
Please remove or update the docstrings
tests/test_api.py
Outdated
| assert id == basename | ||
|
|
||
| def test_bytesio_file_id_defaults_to_md5(): | ||
| """Test that no detections are made above the nyquist frequency.""" |
There was a problem hiding this comment.
Please remove or update the docstrings
|
@mbsantiago Sorry to bother you again, but I was just thinking if we should add an example for process_url to the README? (the URL in my example is temporary so we should provide another in the example, or just have Thanks a lot for taking your time to review and accept this feature :) |
|
Hi again @kaviecos! For sure, that's a great suggestion. Please feel free to make a new PR with changes. Would be nice to do it that way to keep track of your contributions, but if you don't have time I can add it later. Cheers! |
Suggestion for implementing support for more ways to load audio data.
This should be backwards compatible.
There is one additional change to the
api.process_fileanddu.process_filefrom the once we already discussed.I've had to add a "file_id" argument. It is optional and uses the basename approach as a fallback if file_id is not provided. The reason is that we cannot get the basename of a byte array.