social/reddit

Your model will only be able to properly operate on input of the type it was trained on. You can’t throw full recordings into a model trained on short clips and expect it to perform the same.

If you want to be able to dump a full-length recording in and have all possible sound sources labelled, then you need to train the model using full-length recordings as input, with all the sounds labelled. Something like that would likely be more difficult to train than your simpler model, but that’s a separate issue.

If you want to use your model that was trained on short single-sound audio clips, then the input needs to be short (potentially) single-sound audio clips. Maybe you can do pre-processing on the full recording to make the input fit into your current model… For example, segment the full recording into overlapping 2-3 second chunks, predict the contents of the chunks, and then aggregate them into a composite prediction for the full recording. Or maybe you want to use some other technique to identify sound onsets to do the segmentation more intelligently.

I think a pre- and post-processing pipeline in your case would be easier than trying to train a single model to do it all at once, but that’s up to you.