A vocal pitch modulator that uses Machine Learning for realistic voice change. This is a project for NUS's CS4347 (Sound and Music Computing).
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
Make sure you have the right environment to run.
Run the requirements.txt
file to install neccessary modules.
For anaconda environment:
conda install requirements.txt
The goal of vocal pitch modulation in this project is to maintain a “realistic” sound when the pitch is changed. When we use conventional modulation techniques, increasing the pitch of an audio file by more than around 3 semitones tends to make people sound like chipmunks, while lowering the pitch by more than around 3 semitones makes them sound demonic/dopey. However, there are people who speak at lower and higher pitches without sounding this way, so it is not simply the case that lower/higher pitched voices sound this way, but the spectral characteristics must be adjusted in a suitable manner to keep realism. As such, in this project, we wish to employ machine learning to adjust pitch without losing realism.
The relevant python files to refer to are as follows:
- ANN.py: Contains the pytorch model code (
TimbreEncoder
,TimbreVAE
,BaseFFN
) and training functions - DemoPitchShifter.py: Contains the wrapper functions used in the demo notebook
- PitchShifterModule.py: Contains the final pitch shift function that our project aimed to create. This function uses the naive pitch shift, followed by timbre conformation by using an ANN that we have trained. Model data itself can be found here.
- Utils.py: Contains utility and graphing functions for convenience that are used throughout the notebooks.
- VPM.py: Contains the main functions, e.g. naive pitch shifting, conversion of audio to FFTs, FFTs to mels, etc.
Relevant Jupyter Notebooks are as follows:
- Data Processing for Training Walkthrough: This details our data processing workflow
- Demonstration Notebook App: This is where you can try out the actual pitchs shifter.
- Pitch Up Recreation Attempts: This notebook details our experiments when finding a useable pitch shift method.
- SpecificModelCreation: This is the main ANN training notebook that was used to create all the models used in our final product.
- Timbre Encoder: This was the training notebook for the Timbre Encoder portion, that was used for architecture 2.
As for directories:
- Archive: Simply an archive of old work
- Data: Inside the folder, you will find the list of files along with the relevant labels in dataset_files.csv. You will also be able to find the Jupyter Notebooks that were used to generate the dataset, and the raw file list, but we are not including the raw files in this repository, so these will not be for use, but for reference.
- Data/dataset: The dataset we are training our Artificial Neural Networks with can be found in the folder.
- Documentation: Figures detailing the implementation of the Vocal Pitch Modulator
- many_expts: The audio files produced by different experiments can be found in folder.
- model_data: Contains the model data for our final model.
- output_wav: The directory that the demo notebook will output pitch shifted sounds to
The following is the proposed modulation pipeline:
Image 1: Overall Vocal Pitch Modulation System Design
Please refer to Vocal Pitch Modulation Audio and Waveform presentation page to listen to our reconstructed results for each methods and experiments we took.
As seen in Image 1, pre-processing stage converts audio wav file, apply STFT function and create either STFT, Mel-spectrum or MFCC to be used for further processing. This stage also includes data and pitch pairing for ANN Training which will be further shown in stage 3.
As seen in Image 1, pitch shift takes into the mel-spectrum, train and output the pitch shifted mel-spectrum to be fed for further timbre training in stage 3.
Architecture 2 has an additional Timbre Encoder ANN and the pipline is as follows:
As seen in Image 1, post-processing stage reconstruct the STFT function to give our output result.
For a walkthrough of the typical data processing that we conducted, refer to the Data Processing for Training Walkthrough.
The following image is the vowel diagram we followed for dataset collection.
The following are additional aids which illustrates the organization of our data.
Thank you to CS4347 Sound and Music Computing Prof Wanye and WeiWei.
Big thanks to Vocal Pitch Modulation Team 13:
- Louiz-Kim
- Rachel Tan
- Zachary Feng
- Shaun Goh
Yin-Jyun Luo and team's work has greatly gave us huge inspiration to get started.