Nachtigall: dev log 1
Introduction
We're creating a little program that takes in a soundfile, estimates the pitch and returns it as human readable notation. See previously:
So far we have:
The problem
If the program gets a clean, machine produced sample, all is good. If a voice sample gets fed though, the voiced_flag array is all false. Basically, the pyin algorithm struggles to see it as voiced frequencies. If we just take all frequencies however, notes separated with lots of silence until the next note are badly classified, as the "frequencies of the silence" contribute to the final value.
What can we do
We could focus on detecting note ending. One fun way to do that would be to reverse the imput sample, then pass it through the onset detection algorithm and see if it detects the end of the note well enough. It's a bit ridiculous, I like it. Come to think of it, if there is onset detection in librosa, is there something for the note ends as well? No, and reading the onset detection documentation, it will detect the peaks in the envelope like before and bring is nowhere. So that's a no-go.
We could detect the silences, which should be minimums in amplitudes. If I understand correctly, if I were to always start the recordings with a bit of silence, I would have in y an estimate of the background silence. If I null every entry of y below that amplitude, maybe I can get somewhere. The question is, how long? I have the sample rate, so I can compute, say, half a second. Then take the mean amplitude of that silence and null all above. Sounds doable, let's try it out.
Attempt A, background noise suppression
Sample rate = n samples/ time hence:
sample_range = 0.5s times sample rate.
Yields:
C2,C3,C1,CāÆ1,F1,C0
For a C2,C3,C1,CāÆ1,F1,E1 ground truth. Not there yet. But I'll keep it in, as it does not work horribly for the voiced samples. (Edit: bad idea, as we'll see.)
Attempt B, potentially better background noise suppression
If we look at the misclassified note, we have the following.
It starts solidly, then we get trailing nonsense. I could boost the noise suppression. Provided there is no click of a button, maybe it will erase that part to nothingness. I set it to max instead of average. Nothing changes, which is weird. Does the noise suppression even work? No, variable investigation indicates that the variable y contains the same before and after the noise suppression. Huh. In the clean, machine produced .wav, the sound is indeed 0 for half a second. And for the voice? Same? The average is a negative value, because those are amplitudes... My simplistic attempt is not good enough, one would have to cut an entire amplitude band. Attempt A was misguided.
Librosa must have tools for this. Ah, lucky for us, while investigating the doc, I found librosa.effects.trim, precisely made for noise suppression ("Trim leading and trailing silence from an audio signal." says the doc). Let's try it out. It's as simple as:
It uses the peaks in the original signal as reference. Replacing the y with it in the script is trivial, but it still leads to the same result. One reread later, we can input a reference in decibel to silence out. Let's try it out. Even when I put 80 db it does not appear to do a thing. I must miss something. Let's put an absurdly high value. Ah, rereading the doc indicates that it's a reference in decibel below the reference, so it's relative... Still does not change any value.
Attempt C, verifying that the ground truth is alright
I checked the machine generated .wav that I use for the tests above. And I noticed that the profile of the last note was visually distinct. And indeed, in retrospect, the sound is low enough to clip a bit. That is not really representative. So let's record a new sample at a better range:
C5,D5,D#6,A#5,A5,C6
Feeding it to the program yields:
5 notes detected:
C3,D3,DāÆ4,AāÆ3,A3
That's... not the same. At least it's the right notes, but the octaves are poor and an entire note is missing. Making the fmin/fmax of pyin more generous doesn't change things. Alright, is it an issue of the m8, with which I am generating the whole thing? Trying it out with a voiced sample, the onset detection is wonky and the noise suppression does not appear to be super effective.
Alright, let's use another source: alda. Music sheet is: A4 B#4 C4 D4 F#4, with rests in between each notes.
What does it yield?
A4,C5,C4,D4
Disappointing. The amplitude is a bit weak though in the sample. If I boost it, is it better?
Same exact result.
Reassessing the situation
The program is wonky. Both the note detection and classification have many errors.
When I look at the onset detections with clicks, the starts are well done, the tail is often weird. N-1 notes are detected too often, meaning there is probably an issue with the loop that does the note segmenting. The detected frequencies at the beginning of the notes seems accurate enough. So it's really about cutting those notes better, it appears. Even if I damp it down to 0, it will get included.
First let's fix that N-1 note detection. Looking at the code, it screams at my eyes:
I assumed the note end would have an onset too... In reality, there are as many notes as onsets.
Now we get as many classifications as we should. They are still wrong of course. But we will fix that next.
Attempt D, throwing cleverness out of the window
I tried several other ways to detect silences, but with unreliable results so far. So let's do something much simpler: we'll only consider the first third of the sequence between onsets.
Will it cover all usecases? No. Will it cover my specific usecase, where notes are spelled out with silences separately and at roughly the same length time-wise?
And that's a nope.
Some onset detections yield apparently sometimes too short time ranges.
Summary
Sadly, not much process today. That's life sometimes. The next thing I want to try is to change the data representation in order to see if it is easier to solve using a spectrogram. We'll see.
References
librosa onset detection documentation
Licence
Given that I technically have done a derivative of the existing software by remixing a librosa example file, here is the applicable license. Many thanks to Brian McFee as well as all the other contributors for making my life easier, I appreciate it.