Voice conversion is a difficult task, as it requires changing acoustic features while preserving naturalness, intelligibility, and identity of the speech. Data scarcity for parallel, one-shot, or cross-lingual methods is a major challenge, but data augmentation, transfer learning, and unsupervised learning can help. It is also important to select the right features for voice conversion; they should capture relevant information such as phonetic, prosodic, and speaker characteristics while discarding irrelevant or noisy information. Spectral, cepstral, and vocoder-based features are some of the common choices. Evaluating the quality and similarity of the converted voices is challenging due to subjective and objective criteria. Subjective evaluation involves humans rating the output on various aspects while objective evaluation involves mathematical metrics measuring the distance or correlation between output and target features. However, there is not always agreement between subjective and objective evaluation, with different methods having different strengths and weaknesses.