1. Introduction

With its origins in Information Retrieval research, a fundamental goal of Music Information Retrieval (MIR) as a dedicated research field in the year 2000 was to develop technology to assist the user in finding music, information about music, or information in music (). Since then, also driven by the developments in content-based analysis, semantic annotation, and personalization, intelligent music applications have had significant impact on people’s interaction with music. These applications comprise “active music-listening interfaces” () which augment the process of music listening to increase engagement of the user and/or give deeper insights into musical aspects also to musically less experienced listeners. For accessing digital repositories of acoustic content, retrieving relevant pieces from music collections, and discovering new music, interfaces based on querying, visual browsing, or recommendation have facilitated new modes of interaction.

Revisiting an early definition of MIR by Downie () as “a multidisciplinary research endeavor that strives to develop innovative content-based searching schemes, novel interfaces, and evolving networked delivery mechanisms in an effort to make the world’s vast store of music accessible to all” 16 years later, reveals that these developments were in fact intended and implanted into MIR from the beginning. Given the music industry landscape and how people listen to music today, this visionary definition has undoubtedly stood the test of time.

In this paper, we reflect on the evolution of MIR-driven user interfaces for music browsing and discovery over the past two decades—from organizing personal music collections to streaming a personalized selection from “the world’s vast store of music”. Therefore, we connect major developments that have transformed and shaped MIR research in general, and user interfaces in particular, to prevalent and emerging listening practices at the time. We identify three main phases that have each laid the foundation for the next and review work that focuses on the specific aspects of these phases.

First, we investigate the phase of growing digital personal music collections and interfaces built upon intelligent audio processing and content description algorithms in Section 2. These algorithms facilitate the automatic organization of repositories and finding music in personal collections, as well as commercial repositories, according to sound qualities. Second, in Section 3, we investigate the emergence of collective web platforms and their exploitation for listening interfaces. The extracted user-generated metadata often pertains to semantic descriptions and complements the content-based methods that facilitated the developments of the preceding phase. This phase also constitutes an intermediate step towards exploitation of collective listening data, which is the driving force behind the third, and ongoing phase, which is connected to streaming services (Section 4). Here, the collection of online music interaction traces on a large scale and their exploitation in recommender systems are defining elements. Extrapolating these and other ongoing developments, we outline possible scenarios of music recommendation and listening interfaces of the future in Section 5.

Note that the phases we identify in the evolution of user interfaces for music discovery also reflect the “three ages of MIR” as described by Herrera (). Herrera refers to these three phases as “the age of feature extractors”, “the age of semantic descriptors” and “the age of context-aware systems”, respectively. We further agree on the already ongoing “age of creative systems” that builds upon MIR to facilitate new interfaces that support creativity as we discuss in Section 5. We believe that this strong alignment gives further evidence of the pivotal role of intelligent user interfaces in the development of MIR. While user interfaces, especially in the early phases, were often mere research prototypes, their development is tightly intertwined with ongoing trends. Thus, they provide essential gauges to the state of the art, and, even beyond, give perspective of what could be possible.

2. Phase 1: Content-Based Music Retrieval Interfaces

The late 1990s see two pivotal developments. On one hand, the Internet gets established as mainstream communication medium and distribution channel. On the other hand, technological advances in encoding and compression of audio signals (most notably mp3) allow for distribution of hi-fi audio content via the Internet and lead to the development of high capacity portable music players (; ). This impacts not only the music industry, but also initiates a profound change in the way people “use” music ().

At the time, the most popular and conventional interfaces for such music access display the list of bibliographic information (metadata) such as titles and artist names. When the number of musical pieces in a personal music collection is not large, music interfaces with the title list and mere text searches based on bibliographic information are useful enough to browse the whole collection to choose pieces to listen to. However, as the accessible collection grows and becomes largely unfamiliar, such simple interfaces become insufficient (; ), and new research approaches targeting the retrieval, classification, and organization of music emerge.

“Intelligent” interfaces for music retrieval become a research field of interest with the developments in content-based music retrieval (). A landmark in this regard is the development of query by humming systems () and search engines indexing sound properties of loudness, pitch, and timbre () that initiate the emancipation of music search systems from traditional text- and metadata-based indexing and query interfaces. While interfaces are still very much targeted at presenting results in sequential order according to relevance to a query, starting in the early 2000s, MIR research proposes several alternatives to facilitate music discovery.

2.1 Map-based music browsing and discovery

Interfaces that allow content-based searches for music retrieval are useful when people can formulate good queries and especially when users are looking for a particular work, but sometimes it is difficult to come up with an appropriate query when faced with a huge music collection and vague search criteria. Interfaces for music browsing and discovery are therefore proposed to let users encounter unexpected but interesting musical pieces or artists. Visualization of a music collection is one way to provide users with various bird’s-eye views and comprehensive interactions. The most popular visualization is to project musical pieces or artists onto a 2D or 3D space (“map”) by using music similarity. 2D visualizations also lend themselves to being applied on tabletop interfaces for intuitive access and interaction (e.g. ). The trend of spatially arranging collections for exploration can be seen throughout the past 20 years and is still unbroken, cf. Figure 1.

Figure 1 

Examples of map-based music browsing interfaces based upon dimensionality reduction techniques.

One of the earliest interfaces is GenreSpace by Tzanetakis et al. () that visualizes musical pieces with genre-specific colors in a 3D space (see Figure 1(a) for a greyscale image). Coloring of each piece is determined by automatic genre classification. The layout of pieces is determined by principal component analysis (PCA), which projects high-dimensional audio feature vectors into 3D positions.

Another early interface by Pampalk et al. () called Islands of Music visualizes musical pieces on a 2D space representing an artificial landscape, cf. Figure 1(b). It uses a self-organizing map (SOM) to arrange musical pieces so that similar pieces are located near each other, and uses a metaphor of “islands” that represent self-organized clusters of similar pieces. The denser the regions (more items in the same cluster), the higher the landscape (up to “mountains” for very dense regions). Sparse regions are represented by the ocean. Several extensions of the Islands of Music idea were proposed in the following years. An aligned SOM is used by Pampalk et al. () to enable a shift of focus between clusterings created for different musical aspects. This interface provides three different views corresponding to similarities based on three aspects: (1) timbre analysis, (2) rhythm analysis, and (3) metadata like artist and genre. A user can smoothly change focus from one view to another while exploring how the organization changes. Neumayer et al. () propose a method to automatically generate playlists by drawing a curve on the SOM visualization.

The nepTune interface presented by Knee et al. (), as shown in Figure 1(c), enables exploration of music collections by navigating through a three-dimensional artificial landscape. Variants include a mobile version () and a larger-scale version using a growing hierarchical self-organizing map () that automatically structures the map into hierarchically linked individual SOMs (). Lübbers and Jarke () present a browser employing multi-dimensional scaling (MDS) and SOMs to create 3-dimensional landscapes. In contrast to the Islands of Music metaphor, they use an inverse height map, meaning that agglomerations of songs are visualized as valleys, while clusters are separated by mountains. Their interface further enables the user to adapt the landscape by building or removing mountains, which triggers an adaptation of the underlying similarity measure.

Another SOM-based browsing interface is Globe of Music by Leitich and Topf (), which maps songs to a sphere instead of a plane by means of a GeoSOM (). Mörchen et al. () employ an emergent SOM and the U-map visualization technique () to color-code similarities between neighboring clusters. Vembu and Baumann () incorporate a dictionary of musically related terms to describe similar artists.

While the above interfaces focus on musical pieces, interfaces focusing on artists have also been investigated. For example, Artist Map by van Gulik and Vignoli () is an interface that enables users to explore and discover artists. This interface projects artists onto a 2D space and visualizes them as small dots with genre-specific, tempo-specific, or year-specific colors, cf. Figure 1(d). This visualization can also be used to create playlists by drawing paths and specifying regions.

In the Search Inside the Music application, Lamere and Eck () use a three-dimensional MDS projection, cf. Figure 1(e). Their interface provides different views that arrange images of album covers according to the output of the MDS, either in a cloud, a grid, or a spiral.

Other examples use, e.g., metaphors of a “galaxy” or “cosmos,” or extend visualizations with additional information. MusicGalaxy by Stober and Nürnberger (), for example, is an exploration interface that uses a similarity-preserving projection of musical pieces onto a 2D galaxy space. It takes timbre, rhythm, dynamics, and lyrics into account in computing the similarity and uses an adaptive non-linear multi-focus zoom lens that can simultaneously zoom multiple regions of interest while most interfaces support only a single region zooming, cf. Figure 1(f). The related metaphor of a “planetarium” has been used in Songrium by Hamasaki et al. (). Songrium is a public web service for interactive visualization and exploration of web-native music on video sharing services. It uses similarity-preserving projections of pieces onto both 2D and 3D galaxy spaces and provides various functions: analysis and visualization of derivative works, and interactive chronological visualization and playback of musical pieces, cf. Figure 1(g).

Vad et al. () apply t-SNE () to mood- and emotion-related descriptors, which they infer from low-level acoustic features. The result of the data projection is visualized on a 2D map, around which the authors build an interface to support the creation of playlists by drawing a path and by area selection, as can be seen in Figure 1(h).

MoodPlay by Andjelkovic et al. () uses correspondence analysis on categorical mood metadata to visualize artists in a latent mood space, cf. Figure 1(i). More details on the interactive recommendation approach facilitated through this visualization can be found in Section 4.1.

Instrudive by Takahashi et al. () enables users to browse and listen to musical pieces by focusing on instrumentation detected automatically. It visualizes each musical piece as a multicolored pie chart in which different colors denote different instruments, cf. Figure 1(j). The ratios of the colors indicate relative duration in which the corresponding instruments appear in the piece.

2.2 Content-based filtering and sequential play

When a collection of music becomes huge, it is not feasible to visualize all pieces in the collection. Other types of interfaces that visualize a part of the music collection instead of the whole have also been proposed. An example is Musicream by Goto and Goto (), a user interface that focuses on inducing active user interactions to discover and manage music in a huge collection. The idea behind Musicream is to see if people can break free from stereotyped thinking that music playback interfaces must be based on lists of song titles and artist names. To satisfy the desire “I want to hear something,” it allows a user to unexpectedly come across various pieces similar to ones that the user likes. As shown in Figure 2(a), disk icons representing pieces flow one after another from top to bottom, and a user can select a disk and listen to it. By dragging a favorite disk in the flow, which serves as the query, the user can easily pick out other pieces similar to the query disk (attach similar disks) by using content-based similarity. In addition, to satisfy a desire like “I want to hear something my way,” Musicream gives a user greater freedom of editing playlists by generating a playlist of playlists. Since all operations are automatically recorded, the user can also visit and retrieve a past state as if using a time machine.

Figure 2 

Interfaces for sequential exploration of collections based on content similarity.

The FM4 Soundpark Player by Gasser and Flexer () makes content-based suggestions by showing up to five similar tracks in a graph-like manner, cf. Figure 2(b), and constructing “mixtapes” from given start and end tracks (). VocalFinder by Fujihara et al. () enables content-based retrieval of songs with vocals that have similar vocal timbre to the query song.

Visualization of a music collection is not always necessary to develop music interfaces. Stewart et al. () present an interface that uses only sound auralization and haptic feedback to explore a large music collection in a two or three-dimensional space.

The article “Reinventing the Wheel” by Pohle et al. () reveals that a single-dial browsing device can be a useful interface for musical pieces stored on mobile music players. The whole collection is ordered in a circular locally-consistent playlist by using the Traveling Salesman algorithm so that similar pieces can be arranged adjacently. The user may simply turn the wheel to access different pieces. This interface also has the advantage of combining two different similarity measures, one based on timbre analysis and the other based on community metadata analysis. Figure 2(c) shows an extended implementation of this concept by Schnitzer et al. () on an Apple iPod, the most popular mobile listening device at the time.

2.3 Summary of Phase 1

Phase 1 is strongly connected to browsing interfaces that make use of features extracted from the signal and present repositories in a structured manner to make them accessible. As many of these developments are rooted in the early years of MIR research, they often reflect the technological state of the art in terms of content descriptors, with the discovery interface attached as a communication vehicle to present the capabilities of the underlying algorithms. Thus, a user experience (UX) beyond the possibility of experiencing a novel, alternative view on collections or being assisted in the task of creating playlists is not the focus of these discovery interfaces. Consequently, user-centric evaluations of the interfaces are scarce and often only anecdotal.

Later works put more emphasis on evaluation of the proposed interfaces. Findings include that while users initially expect to find genre-like structures on maps, other organisation criteria like mood are perceived positively for exploration, rediscovery, and playlist generation, once they become familiar (; ).

3. Phase 2: Collaborative and Automatic Semantic Description

While content-based analysis allowed for unprecedented views on music collections based on sound, interfaces built solely upon the extracted information were not able to “explain” the music contained or give semantically meaningful support for orientation within the collections. That is, while they are able to capture qualities of the sound of the contained music, they largely neglect existing concepts of music organization, such as (sub-)genres, and how people use music, e.g., according to mood or activity (; ). This and other cultural information is however typically found on the web and ranges from user-generated tags to unstructured bits of expressed opinions (e.g., forum posts or comments in social media) to more detailed reviews and encyclopedic articles (containing, e.g., biographies and discography release histories). In MIR, this type of data is often referred to as community metadata or music context data ().

These online “collaborative efforts” of describing music are resulting in a rich vocabulary of semantic labels (“folksonomy”) and have shaped music retrieval interfaces towards music information systems starting around 2005. A very influential service at this time, both as a music information system and a source for semantic social tags, is Last.fm. In parallel, platforms like Audioscrobbler, which merged with Last.fm in 2005, take advantage of users being increasingly always connected to the Internet and tracking listening events for the sake of identifying listening patterns and making recommendations, leading to the phase of automatic playlisting and music recommendation (cf. Section 4). In this section, we focus on semantic labels, such as social tags (), describing musical attributes as well as metadata and descriptors of musical reception, as a main driver of MIR research and music interfaces.

3.1 Collaborative platforms and music information systems

With music related information being ubiquitous on the web, dedicated web platforms that provide background knowledge on artists emerge, e.g. the AllMusic Guide, depending on editorial content. Using new technologies, such music information systems can, however, also be built by aggregating information extracted from various sources, such as knowledge bases (; ) or web pages (), or by taking advantage of the “wisdom of the crowd” () and building collaborative platforms like the above mentioned Last.fm.

A central feature of Last.fm is to allow users to tag their music, ideally resulting in a democratic ground truth () of what could be considered the semantic dimensions of the corresponding tracks, cf. Figure 3(a). However, typical problems arising with this type of information are noisy and non-trustworthy information as well as data sparsity and cold start issues mostly due to popularity biases (cf. ).

Figure 3 

Exemplary sources for human-generated semantic annotations. (a) collaborative tags; (b) and (c) games with a purpose or crowdsourcing.

MIR research during this phase therefore deals extensively with auto-tagging, i.e., automatically inferring semantic labels from the audio signal of a music piece (or related data), to overcome this shortcoming (e.g. ; ; ; ; ; ; ).

Alternative approaches to generate semantic labels involve human contributions. TagATune by Law et al. () is a game that pairs players across the Internet who try to determine whether they are listening to the same song by typing tags, cf. Figure 3(b). In return for entertaining users, TagATune has collected interesting tags for a database of songs. Other examples of interfaces that were designed to collect useful information while engaging with music are MajorMiner by Mandel and Ellis () (see Figure 3(c)), Listen Game by Turnbull et al. (), HerdIt by Barrington et al. (), and Moodswings by Kim et al. () (cf. Section 4.1).

A more traditional way to obtain musically informed labels is to have human experts, e.g. trained musicians, manually label music tracks according to predefined musical categories. This approach is followed by the Music Genome Project, and serves as the foundation of Pandora’s automatic radio stations (cf. Section 4). In the Music Genome Project, according to Prockup et al. (), “the musical attributes refer to specific musical components comprising elements of the vocals, instrumentation, sonority, and rhythm.”

As a consequence of these efforts, during this phase, the question of how to present and integrate this information into interfaces was secondary to the question of how to obtain it, as will become obvious next.

3.2 Visual interfaces

With the trend towards web-based interfaces, visualizations and map-based interfaces integrating semantic information have been proposed. This semantic information comprises tags typically referring to genres and musical dimensions such as instrumentation, as well as geographical data and topics reflecting the lyrical content.

MusicRainbow by Pampalk and Goto () is a user interface for discovering unknown artists, which follows the above idea of a single-dial browsing device but features informative visualization. As shown in Figure 4, artists are mapped on a circular rainbow where colors represent different styles of music. Similar artists are automatically mapped near each other by using the traveling salesman algorithm and summarized with word labels extracted from artist-related web pages. A user can rotate the rainbow by turning a knob and find an interesting artist by referring to the word labels. The nepTune interface shown in Figure 1(c) also provides a mode that integrates text-based information extracted from artist web pages for supporting navigation in the 3D environment. To this end, labels referring to genres, instruments, origins, and eras serve as landmarks.

Figure 4 

MusicRainbow: An artist discovery interface that enables a user to actively browse a music collection by using audio-based similarity and web-based labeling.

Other approaches explore music context data to visualize music over real geographical maps, rather than computing a clustering based on audio descriptors. For instance, Govaerts and Duval () extract geographical information from biographies and integrate it into a visualization of radio station playlists, cf. Figure 5. Hauger and Schedl () extract listening events and location information from microblogs and visualize both on a world map.

Figure 5 

Automatically enriched information system by Govaerts and Duval ().

Lyrics are also important elements of music. By using semantic topics automatically estimated from lyrics, new types of visual interfaces for lyrics retrieval can be achieved. LyricsRadar by Sasaki et al. () is a lyrics retrieval interface that uses latent Dirichlet allocation (LDA) to analyze topics of lyrics and visualizes the topic ratio for each song by using the topic radar chart. It then enables a user to find her favorite lyrics interactively. Lyric Jumper by Tsukuda et al. () is a lyrics-based music exploratory web service that enables a user to choose an artist based on topics of lyrics and find unfamiliar artists who have a similar profile to her favorite artist. It uses an advanced topic model that incorporates an artist’s profile of lyrics topics and provides various functions such as topic tendency visualization, artist ranking, artist recommendation, and lyric phrase recommendation.

3.3 Summary of Phase 2

The second phase of music discovery interfaces gives emphasis to textual representations in interfaces to convey semantic features of the music tracks to the user. This gives the user deeper insights into the individual tracks and allows for exploration through specific facets, rather than structuring repositories and identifying neighboring tracks based on a similarity function integrating various aspects. On the user’s side, these interfaces require a more active exploration and selection of relevant properties when browsing.

With the integration of semantic information from structured, semi-structured, and unstructured sources, traditional retrieval paradigms become again more relevant in the music discovery process (cf. ). At the same time, the extracted music information as well as the data collected during interaction with collaborative platforms can be exploited to facilitate passive discovery, leading to Phase 3.

4. Phase 3: Recommender Interfaces and Continuous Streaming

With ubiquitous Internet connection and the development of computer and entertainment systems to be always online, physical music collections have lost relevance to many people, as virtually all music content is available at all times. In essence, subscription streaming services like Spotify, Pandora, Deezer, Amazon Music and Apple Music have transformed the music business and music listening alike.

A central element to these services is the aspect of personalization, i.e., providing foremost a user-tailored view onto the available collections of allegedly tens of millions of songs. Discovery of music is therefore also performed by the system, based on the user profile of past interactions, rather than just by the user herself.

Music recommendation typically models personal preferences of users by using their listening histories or explicit user feedback (e.g. ; ). It then generates a set of recommended musical pieces or artists for each user. This recommendation can be implemented by using collaborative filtering based on users’ past behaviors, and exhibits patterns of music similarity not captured by content-based approaches (). When the playback order of recommended pieces is important, automatic playlist generation is also used (e.g. ; ; ).

The main challenges of this type of algorithm are, as in all other domains of recommender systems, cold start problems. The approach taken to remedy these is again to integrate additional information on the music items to be recommended, i.e. facets of content and metadata as applied in the earlier phases, by building hybrid recommenders on top of pure collaborative filtering. Additionally, context-awareness plays an important role, for instance to recommend music for daily activities ().

This still ongoing phase starts around 2007 and sees further boosts around 2010 and 2015, with an unbroken upward trend. An overview of aspects, techniques and challenges of music recommender systems is described by Schedl et al. (). Therefore, in this section, we do not elaborate on the basics of music recommender systems. Instead, we highlight again interfaces that focus on personalization and user-centric aspects (Section 4.1) and the recent trend to introduce psychologically-inspired user models in recommender algorithms (Section 4.2), as we consider these to be the bridge to future intelligent music listening interfaces.

4.1 Recommender interfaces

Although most related studies have focused on methods and algorithms of music recommendation and playlist generation, or user experiences of recommender systems, some studies focus on interfaces.

MusicSun by Pampalk and Goto () is a user interface for artist recommendation. A user first puts favorite artist names into a “sun” metaphor, a circle in the center of the screen, and then obtains a ranked list of recommended artists. The sun is visualized with some surrounding “rays” that are labeled with words to summarize the query artists in the sun. By interactively selecting a ray, the user can look at and listen to the corresponding recommended artists.

MoodPlay by Andjelkovic et al. () is an interactive music recommender system that uses a hybrid recommendation algorithm based on mood metadata and audio content, cf. Section 2.1. A user first constructs a profile by entering favorite artist names and then obtains a ranked list of recommended artists, highlighted in a latent mood space visualization, cf. Figure 1(i). The centroid of profile artist positions is used to recommend nearby artists. The change of a user’s preference is interactively modeled by moving in this space and its trail is used to recommend artists.

In MoodSwings (), users try to match each other while tracing the trajectory of music through a 2D emotion space. The users’ input provides metadata on the emotional impression of songs as it changes over time.

More recently, studies have focused on the design of user-centric recommender interfaces to account for individual preferences and control of the recommendation process. Jin et al. () investigate the impact of different control elements for users to adapt recommendations, while aiming at preventing cognitive overload. One finding is that users with high musical sophistication index () not only appreciate higher control over recommendations but also perceive adapted recommendations to be of higher quality, leading to higher acceptance. The impact of personal characteristics on preferences of visual control elements is further investigated by Millecamp et al. (). Again, participants with high musical sophistication index, as well as Spotify power users, showed strong preference for control via a radar chart over traditional sliders for adapting recommendation parameters for discovery of music, cf. Figure 6. Kamehkhosh et al. () investigate the implications of recommender techniques on the discovery of music in playlist building. They find that recommendations displayed in visual playlist building tools are actively incorporated by users and even impact the choices made in playlist creation when recommendations are not directly incorporated.

Figure 6 

Personalized control over recommendations ().

Overall, these interfaces and studies about interfaces show a clear trend towards personalization and user-centric development, integrating aspects of personality and affect (cf. ). This observation is further supported by works dealing with psychologically-inspired music recommendation as described next.

4.2 Psychologically-inspired music recommendation

Recently, music recommender research is experiencing a boost in topics related to psychology-informed recommendation. In particular the psychological concepts of personality and affect (mood and emotion) are increasingly integrated into prototypes. The motivation for this is that while listening to music both personality traits and affective states have been shown to influence music preferences strongly (; ; ).

Lu and Tintarev () propose a system that re-ranks results of a collaborative filtering approach according to the degree of diversity each song contributes to the recommendation list. Since previous studies showed that personality is most strongly correlated with musical key, genre, and number of artists, the authors implement diversity through these features and adjust results depending on the listener’s personality. Fernández-Tobías et al. () propose a personality-aware matrix factorization approach that integrates a latent user factor describing users’ personality in terms of the Big Five/OCEAN model with the 5 factors openness, conscientiousness, extraversion, agreeableness, and neuroticism (). Deng et al. () propose an emotion-aware recommender for which they extract music listening information and emotions from posts in Sina Weibo, a popular Chinese microblogging service, adopting a lexicon-based approach (Chinese dictionaries and emoticons). FocusMusicRecommender by Yakura et al. () recommends and plays back musical pieces suitable to the user’s current concentration level estimated from the user’s behavior history.

4.3 Summary of Phase 3

The still ongoing third phase of music discovery interfaces is driven by machine learning methods to predict the “right music” at the “right time” for each user. To this end, user profiles consisting of previous interactions, as well as potentially any other source of information on the user, such as context data or personality features, are exploited.

Current commercial platforms and their interfaces are designed to cover a variety of use cases, by providing applications with different foci. As different usage scenarios and user intents require different types of recommendation strategies, the user is given the choice as to which focus is best suited in the current situation, by offering different applications to select from. For instance, discovery of new tracks (e.g. as in Spotify’s Release Radar) requires a different strategy than rediscovery of known tracks (e.g. as in Daily Mixes) and a personalized radio station for Workout will have different selection criteria than a radio station for Chill. In addition, platforms integrate many functions of traditional terrestrial radio stations as well, including promotion of artists, and therefore also provide manually curated discovery, e.g. by means of non-personalized radio stations or playlists. Hence, music discovery interfaces have moved away from a one-size-fits-all approach to a suite of applications catering to different listening needs and access paradigms.

5. The Next Phase: The Future of Intelligent Music User Interfaces

Just as technological developments have enabled and shaped the nature of music access in the past — from audio compression to always-online mobile devices — the future will be no different in this regard.

One direction that has already been taken is the streaming of music via so-called smart speakers like Amazon Echo, Google Home, or Apple HomePod, controlled via voice through personal assistants like Alexa, Google Assistant, or Siri, respectively (). For music recommendation, this poses new challenges from recognizing non-standard and ambiguously pronounceable terms like artist names from spoken language to context and intention-aware disambiguation of utterances, e.g. to identify the intended version of a song.

In terms of recommendation approaches this signifies a renaissance of knowledge-based recommender systems () and increasing integration of music knowledge graphs (), enabling conversational interaction and techniques like “critiquing”, an iterative process of evaluation and modification of recommendations based on the characteristics of items (), and a need for story generation techniques (). An example showcasing some of these techniques is the music recommender chatbot MusicBot by Jin et al. (). MusicBot features user-initiated and system-suggested critiquing which have positive impact on user engagement as well as on diversity in discovery. MusicRoBot by Zhou et al. () is another conversational music recommender built upon a music knowledge graph.

As a result, the predominant notion of a music discovery interface being a graphical user interface might lose relevance as interaction moves to a different modality. In this setting, the trends towards context-awareness and personalization, also on the level of individual personality traits, gain even more importance. This amplifies the already central challenge to accurately infer a user’s intent in an action (listening, skipping, etc.), i.e., to uncover the reasons why humans indulge in music, from the comparatively limited signal that is received (; ).

On the other hand, we see the developments in the realm of music generation and variation algorithms. These algorithms create new musical content by learning from large repositories of examples, cf. recent work by Google Magenta (, ; ) and OpenAI, and/or with the help of informed rules and templates, e.g., in automatic video soundtrack creation or adaptive video game music generation. An important development in this research direction is again to give the user agency in the process of creation (“co-creation”). For instance, a personalization approach to melody generation is taken in MidiMe by Dinculescu et al. (), cf. Figure 7(c). Cococo by Louie et al. () is a controlled music creation tool for completion of compositions, giving high-level control of the generative process to the user.

Figure 7 

Interfaces highlighting the confluence of music listening and music (co-)creation.

In the long run, we expect the borders of these domains to blur, i.e., there will be no difference in accessing existing, recorded music and music automatically created by the system tailored to the listener’s needs. More concretely, as discussed as one of grand challenges in MIR by Goto (), we envision music streaming systems that deliver preferred content based on the user’s current state and situational context, automatically change existing music content to fit the context of the user, e.g., by varying instruments, arrangements, or tempo of the track, and even create new music based on the given setting. One of the earliest approaches to customize or personalize existing music is “music touch-up” by Goto (). Further examples are Drumix by Yoshii et al. () and AutoMashUpper by Davies et al. (), cf. Figure 7(a). Lamere’s Infinite Jukebox can also be seen as an example in this direction, cf. Figure 7(b).

With the current knowledge of streaming platforms about a user’s preferences, context sensing devices running the music apps, and first algorithms to variate and generate content, the necessary ingredients for such a development seem to be available already. These developments, along with the increasing interest in the role of Artificial Intelligence (AI) in arts in general, will have a larger impact than just a technological one, raising questions of legal matters regarding ownership and intellectual property () or the perception and value of art, especially AI-created art (). Research in these areas therefore needs to consider a variety of stakeholders.

6. Conclusions and Discussion

We identified three phases of listening culture and discussed corresponding intelligent interfaces. Interfaces pertaining to the first phase focus on structuring and visualizing smaller scale music collections, such as personal collections or early digital sales repositories. In terms of research prototypes, this phase is most driven by content-based MIR algorithms. The second phase deals with web-based interfaces and information systems, with a strong focus on textual descriptions in the form of collaborative tags. MIR research during this phase therefore deals with automatic tagging of music and utilization of tag information in interfaces. Finally, the third and current phase is shaped by lean-back experiences driven by automatic playlist algorithms and personalized recommendation systems. MIR research is therefore shifting towards exploitation of user interaction data, however always with a focus on integration of content-based methods, community metadata, user information, and contextual information of the user. While the former three strategies are typically applied to remedy cold start problems, the integration of context-awareness often amplifies them.

The overview given in this paper focuses on academic interfaces over the past 20 years; however, it is interesting to observe that today’s most successful commercial platforms bear little resemblance to the prototypes discussed. Instead, traditional list or “spreadsheet” views showing the classic metadata fields title, artist, album, and track length still seem to constitute the state of the art in displaying music throughout most applications. This discrepancy between academic work and commercial services affects mostly the interfaces, as the underlying methods for content feature extraction, metadata integration, and recommendation can all be found in similar forms in existing systems. This raises the question whether academic interfaces do not meet users’ desiderata for a music application or if commercial interfaces are missing out on beneficial components.

Lehtiniemi and Holm () have investigated different types of music discovery interfaces and summarized user comments regarding desired features for an “ultimate” music player: “a streaming music service with a large music collection and a mobile client; support for all three modes of music discovery (explorative, active and passive); easy means for finding new music (e.g. textual search, ‘get similar’ type of functionality and mood-based music search); music recommendations with surprising and unexpected results; links to artist videos, biography and other related information; storing, editing and fine-tuning playlists; adapting to user’s own musical taste; support for social networking services; contextual awareness; and customizable and aesthetic look.”

We can see that commercial interfaces tick many boxes from this list, but we can also see how the discussed interfaces from all three phases relate to these aspects and have left their footprints in current systems. While map-based interfaces from Phase 1 see no adoption in current commercial systems, the concepts of similarity-based retrieval, playlist generation, and sequential play are still key elements. From Phase 2, facets of music information systems, such as biographical and related data, can be found in active exploration scenarios, for instance when focusing on the discovery of the work of a specific artist. The aspect of personalization in Phase 3, which is also the basis for serendipitous results in recommendations, is the central feature of current systems. The trends towards context-awareness and adaptive interfaces are ongoing.

As integration of all these requirements is far from trivial and beyond the scope of typical research prototypes, new developments make increasing use of existing and familiar interface elements, e.g. by including or mimicking user interface elements from Spotify (; ; ). Nonetheless, research prototypes will continue to fall short of providing the full music platform experience. A notable exception and example of a comprehensive application originating from research, which is successfully adopted outside of lab conditions, is Songrium by Hamasaki et al. (), which integrates several levels of discovery functions and active music-listening interfaces into a joint application.

To sum up, the evolution of music discovery interfaces has led to the current situation of access to virtually all music catalogs by means of streaming services. On top of that, these services are providing a suite of applications catering to different listening needs and situations. The trend of personalizing listening experiences leads us to believe that, in the not too distant future, music listening will not only be a matter of delivering the right music at the right time, but also of generating and “shaping” the right music for the situation the user is in. We will therefore see a confluence of music retrieval and (interactive) music generation. Beyond this, the topics of explainable recommendations and control over recommendations are gaining importance. Given these exciting perspectives, research in MIR and intelligent user interfaces for music discovery and listening will undoubtedly remain an exciting field to work on.