从音频文件中提取音乐特征在Matlab工具箱中文翻译

更新时间:2024-04-29 22:55:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Proc. of the 10th Int. Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, September 10-15, 2007

A MATLAB TOOL BOX FOR MUSICAL FEATURE EXTRA-

CTION FROM AUDIO

Olivier Lartillot, Petri Toiviainen

University of Jyv?skyl?

Finland lartillo@campus.jyu.fi

Abstract: We present MIRtoolbox, an integrated set of functions written in Matlab, dedicated to the extraction of musical features from audiofiles. The design is based on a modular framework: the differential gorithms are decomposed into stages, formalized using a minimal set of elementary mechanisms, and integrating different variants proposed by alternative approaches – including new strategies we have developed –, that users can select and parametrize.

This paper offers an overview of the set of features, related among others, to timbre, tonality, rhythm or form, that can be extracted with MIRtoolbox. Four particular analyses are provided as examples. The toolbox also includes functions for statistical analysis, egmentation and clustering. Particular attention has been paid to the design of a syntax that offers both simplicity of use and transparent adaptiveness to a multiplicity of possible input types. Each feature extraction method can accept as argument an audiofile, or any preliminary result from intermediary stages of the chain

of operations. Also the same syntax can be used for analyses of single audio files, batches of files, series of audio segments, multichannel signals, etc. For that purpose, the data and methods of the toolbox are organised in an object-oriented architecture. 1. MOTIVATION AND APPROACH

MIRToolbox is a Matlab toolbox dedicated to the extraction of musically-related features from audio recordings. It has been designedin particular with the objective of enabling the computation of a large range of features from databases of audio files, that can be applied to statistical analyses.

第 1 页 共 28 页

Few softwares have been proposed in this area. The most important one, Marsyas [1], provides a general architecture for connectingaudio, soundfiles, signal processing blocks and machine learning (see section 5 for more details). One particularity of our own approach relies in the use of the Matlab computing environment, which offers good visual isation capabilities and gives accessto a large variety of other toolboxes. In particular, the MIRToolbox makes use of functions available in recommended public-domain toolboxes such as the Auditory Toolbox [2], NetLab [3], or SOM toolbox [4]. Other toolboxes, such as the Statistics toolbox or the Neural Network toolbox from MathWorks, can be directly used forfurther analyses of the features extracted by MIRToolbox without having to export the data from one software to another.

Such computational framework, because of its general objectives,could be useful to the research community in Music Information Retrieval (MIR), but also for educational purposes. For that reason, particular attention has been paid concerning the ease of use of the toolbox. In particular, complex analytic processes can be designed using a very simple syntax, whose expressive power comes from the use of an object-oriented paradigm.

The different musical features extracted from the audio files are highly interdependent: in particular, as can be seen in figure 1,some features are based on the same initial computations. In order to improve the computational efficiency, it is important to avoid redundant computations of these common components. Each of these intermediary components, and the final musical features, are therefore considered as building blocks that can been freely articulatedone with each other. Besides, in keeping with the objective of optimal ease of use of the toolbox, each building block has been conceived in a way that it can adapt to the type of input data.

For instance, the computation of the MFCCs can be based on the waveform of the initial audio signal, or on the intermediary representations such as spectrum, or mel-scale spectrum (see Fig. 1).Similarly, autocorrelation is computed for different range of delays depending on the type of input data (audio waveform, envelope, spectrum). This decomposition of all the set of feature extractional gorithms into a common set of building blocks has the advantage of offering a synthetic overview of the different approache studied in this domain of research.

第 2 页 共 28 页

2. FEATURE EXTRACTION

2.1. Feature overview

Figure 1 shows an overview of the main features implemented inthe toolbox. All the different processes start from the audio signal(on the left) and form a chain of operations proceeding to right.The vertical disposition of the processes indicates an increasing order of complexity of the operations, from simplest computation (top) to more detailed auditory modelling (bottom).

Each musical feature is related to one of the musical dimensions traditionally defined in music theory. Boldface characters highlight features related to pitch, to tonality (chromagram, keystrength and key Self-Organising Map, or SOM) and to dynamics (Root Mean Square, or RMS, energy). Bold italics indicate features related to rhythm, namely tempo, pulse clarity and fluctuation.Simple italics highlight a large set of features that can be associated to timbre. Among them, all the operators in grey italics can be in fact applied to many others different representations: for instance, statistical moments such as centroid, kurtosis, etc.,can be applied to either spectra, envelopes, but also to his to gramsbased on any given feature.

One of the simplest features, zero-crossing rate, is based on asimple description of the audio waveform itself: it counts the numberof sign changes of the waveform. Signal energy is computed using root mean square, or RMS [5]. The envelope of the audio signal offers timbral characteristics of isolated sonic event FFT-based spectrum can be computed along the

第 3 页 共 28 页

frequency domainor along Mel-bands, with linear or decibel energy scale, andapplying various windowing methods. The results can be multiplied with diverse resonance curves in order to highlight different aspects such as metrical pulsation (when computing the FFT of envelopes) or fluctuation [6].

Many features can be derived from the FFT:

? Basic statistics of the spectrum gives some timbral characteristics (such as spectral centroid, roll-off [5], brightness, flatness, etc.).

? The temporal derivative of spectrum gives the spectral flux.

? An estimation of roughness, or sensory dissonance, can beassessed by adding the beating provoked by each couple of energy peaks in the spectrum [7].

? A conversion of the spectrum in a Mel-scale can lead to thecomputation of Mel-Frequency Cepstral Coefficients (MFCC)(cf. example 2.2), and of fluctuation[6].

? Tonality can also be estimated (cf. example 2.3).

The computation of the autocorrelation can use diverses normalization strategies, and integrates the improvement proposed by Boersma [8] in order to compensate the side-effects due tothe windowing.Resonance urve are also available here.Autocorre lation can be generalized through a compression of the spectral representation[9].The estimation of pitch is usually based on spectrum, autocorrelation,or cepstrum, or a mixture of these strategies [10].

A distinct approach consists of designing a complete chain of processes based on the modelling of auditory perception of sound and music [2] (circled in Figure 1). This approach can be used inparticular for the computation of rhythmic pulsation (cf. example2.4). 2.2. Example: Timbre analysis

One common way of describing timbre is based on MFCCs [11,2]. Figure 2 shows the diagram of operations. First, the audiosequence is loaded (1), decomposed into successive frames (2),which are then converted into the spectral domain, using the mirspectrum function (3). The spectra are converted from the frequency domain to the Mel-scale domain: the frequencies are rear rearrangedin to 40 frequency bands called Mel-bands1. The envelope of the Mel-scale spectrum is described with the MFCCs, which are obtained by applying the Discrete Cosine Transform to the Melscale spectrum. Usually only a restricted number of

第 4 页 共 28 页

them (for instancethe 13 first ones) are selected (5).

a=miraudio(’audiofile.wav’) (1) f=mirframe(a) (2) s=mirspectrum(f) (3) m=mirspectrum(s,’Mel’) (4) c=mirmfcc(s,’Rank’,1:13) ( 5) The computation can be carried in a window sliding through the audio signal (this corresponded to the code line 1), resultingin a series of MFCC vectors, one for each successive frame, that can be represented column-wise in a matrix. Figure 2 shows anexample of such matrix. The MFCCs do not convey very intuitive meaning perse, but are generally applied to distance computation between frames, and therefore to segmentation tasks (cf. paragrapn2.5).

The whole process can be executed in one single line by calling directly the mirmfcc function with the audio input as argument:

mirmfcc(f,’Rank’,1:13) (6) 2.3. Example: Rhythm analysis

One common way of estimating the rhythmic pulsation, describedin figure 6, is based on auditory modelling [5]. The audio signal is first decomposed into auditory channels using a bank of filters. Diversetypes of filterbanks are proposed and the number of channels can be changed, such as 20 for instance (8). The envelope of each channel is extracted (9)2. As pulsation is generally related to increase of energy only, the envelopes are differentiated, half-waverectified, before being finally summed together again (10). This gives a precise description of the variation of energy produced by each note event from the different auditory channels.

After this onset detection, the periodicity is estimated through autocorrelation (12)3. However, if the tempo varies throughout the piece, an autocorrelation of the whole sequence will not show clear periodicities. In such cases it is better to compute the auto for a frame decomposition (11)4. This yields a periodo gram that highlights the different periodicities, as shown in figure 6. In order to focus on the periodicities that are more perceptible, the

第 5 页 共 28 页

periodogram is filtered using a resonance curve [16] (12), after which the best tempos are estimated through peak picking (13),and the results are converted into beat per minutes (14). Due to the difficulty of choosing among the possible multiples of the tempo,several candidates (three for instance) may be selected for each frame, and a his to gram of all the candidates for all the frames,called periodicity histogram, can be drawn (15).

fb=mirfilterbank(a,20) (8) e=mirenvelope(fb,’Diff’,’Halfwave’) (9) s=mirsum(e) (10) fr=mirframe(s,3,1) (11) ac=mirautocor(fr,’Resonance’) (12) p=mirpeaks(ac,’Total’,1,’NoEnd’) (13) t=mirtempo(p) (14) h=mirhisto(t) (15) The whole process can be executed in one single line by calling directly the mirtempo function with the audio input as argument:

mirtempo(a,’Frame’) (16) In this case, the different options available throughout the processcan directly be specified as argument of the tempo function. Forinstance, a computation of a frame-based tempo estimation, with aselection of the 3 best tempo candidates in each frame, a range of admissible tempi between 60 and 120 beats per minute, an estimationstrategy based on a mixture of spectrum and autocorrelation applied on the spectral flux will be executed with the syntax:

mirtempo(a,’Frame’,’Total’,3,

’Min’,60,’Max’,120,’Spectrum’,

’Autocor’,’SpectralFlux’) (17)

2.4. Segmentation

More elaborate tools have also been implemented that can carry out higher-level analyses and transformations. In particular, audiofiles can be automatically segmented into a series of homogeneous sections, through the estimation of temporal disconti uities along diverse

第 6 页 共 28 页

alternative features such as timbre in particular [17]. First the audio signal is decomposed into frames (18) and one chosenfeature, such as MFCC (19), is computed along these frames. The feature-based distances between all possible frame pairs are stored in a similarity matrix (20). Convolution along the maindiagonal of the similarity matrix using a Gaussian checkerboard kernelyields a novelty curve that indicates the temporal locations of significant textural changes (21).Peak detection applied to the novelty curve returns the temporal position of feature discontinuities(22) that can be used for the actual segmentation of the audio sequence(23)5.

fr=mirframe(a) (18) fe=mirmfcc(fr) (19) sm=mirsimatrix(fe) (20) nv=mirnovelty(sm) (21) ps=mirpeaks(nv) (22) sg=mirsegment(a,ps) (23) The whole segmentation process can be executed in one singleline by calling directly the mirsegment function with the audioinput as argument:

mirsegment(a,’Novelty’) (25) By default, the novelty curve is based on MFCC, but other features can be selected as well using an additional option:

mirsegment(a,’Novelty’,’Spectrum’) (26) A second similarity matrix can be computed, in order to showthe distance – according to the same feature than the one used for the segmentation – between all possible segment pairs (28).6

fesg=mirmfcc(sg) (27) smsg=mirsimatrix(fesg) (28) 2.5. Data analysis

The toolbox includes diverse tools for data analysis, such as a peak extractor, and functions that compute histograms, entropy, zero crossingrates, irregularity or various statistical moments (centroid,spread, skewness, kurtosis, flatness) on data of various types,

第 7 页 共 28 页

suchas spectrum, envelope or histogram.

The mirpeaks functions can accept any data returned by anyother function of the MIRtoolbox and can adapt to the different kind of data of any number of dimensions. In the graphical representation of the results, the peaks are automatically located on the corresponding curves (for 1D data) or bit-map images (for 2D data).

The mirpeaks functions offers alternative possible heuristics.It is possible to define a global threshold that peaks mustexceed for them to be selected. We have designed a new strategy of peak selection, based on a notion of contrast, discarding peaksthat are not sufficiently contrastive (based on a certain threshold)with the neighbouring peaks. This adaptive filtering strategy henceadapts to the local particularities of the curves. Its articulation with other more conventional thresholding strategies leads to anefficient peak picking module that can be applied throughout the MIRtoolbox.

Supervised classification of musical samples can also be performed,using techniques such as K-Nearest Neighbours or Gaussian Mixture Model. One possible application is the classification of audio recordings into musical genres.

3. DESIGN OF THE TOOLBOX

3.1. Data encapsulation

All the data returned by the functions in the toolbox are encapsulated into types objects. The default display method associated toall these objects is a graphical display of the corresponding curves. In this way, when the display of the values of a given analysis is requested,what is printed is not a listing of long vectors or matrices,

but rather a correctly formatted graphical representation.The actual data matrices associated to those data can be obtained by calling a method called mirgetdata, which constructsthe simplest possible data structure associated to the data (cf. paragraph4.1). 3.2. Frame analysis

Frame-based analyses (i.e., based on the use of a sliding window) can be specified using two alternative methods. The first method is based on the use of the mirframefunction, which decomposes an audiosignal into successive frames. Optio nal arguments canspecify the frame

第 8 页 共 28 页

size (in seconds, by default), and the hop factor (between 0 and 1, by default). For instance, in the following code(line 29), the frames have a size of 50 milliseconds and are half overlapped. The results of that function could then be directly sentas input of any other function of the toolbox (30):

f=mirframe(a,.05,.5) (29) mirtempo(f) (30) Yet this first method does not work correctly for instance when dealing with tempo estimation as described in section 2.4. Following this first method, as shown in figure 7, the frame decompositionis the first step performed in the chain of processes. As a result,the input of the filterbank decomposition is a series of short frames,which induces two main difficulties. Firstly, in order to avoid the presence of undesirable transitory state at the beginning of each filtered frame, the initial state of each filter would need to be tuned depending on the state of the filter at one particular instant of the previousframe (depending of the ove rlapping factor). Secondly, the demultiplici tion of the redundancies of the frame decomposition(if the frames are overlapped) throughout the multiple channels of the filterbank would require the use of consequent memoryspace. The technical difficulties and waste of memory induced by this first method can be immediately overcome if the frame decompositionis performed after the filterbank decomposition and recomposition, as shown in figure 8.

This second method, more successful in this context, cannot be managed using the previous syntax, as the input of the mirte mpofunction should not be frame-decomposed yet. The other alternativesyntax consists in proposing the frame decomposition option as a possible argument (’Frame’) of the mirtempo function(31). This corresponds to what was presented in section 2.4 (codelines 16 and 17).

mirtempo(a,’Frame’,.05,.5) (31)

第 9 页 共 28 页

The frame decomposition option is available as a possible argumentto most of the functions of the toolbox. Each functioncan then specify the exact position of the frame decomposition within its chain of operations. Besides, if not specified, the default parameters of the frame decomposition – i.e., frame size and hop factor – can be adapted to each specific function. Hence,from a user’s point of view, the execution and chaining of the different operators of the MIRtoolbox follow the same syntax, be

there frame decomposition or not, apart from the additional use of either the command mirframe or the option ’Frame’ for frame decomposition. Of course, from a developer’s point of view,this requires that each feature extraction algorithm should adapt to frame-decomposed input. More precisely, as will be explained in section 4.1, input can be either a single vector or a matrix, where columns represent the successive frames. Conveniently enough, inthe Matlab environment, the generalization of vector-based algorithmsto matrix-based versions is generally effortless. 3.3. Adaptive syntax

As explained previously, the diverse functions of the toolbox canaccept alternative input: ? The name of a particular audio file (either in wav or au format)can be directly specified as input:

mirspectrum(’myfile’) (32)

第 10 页 共 28 页

? The audio file can be first loaded using the miraudio function, which can perform diverse operations such as resampling,automated trimming of the silence at the beginningand/or at the end of the sequence, extraction of a given subsequence, centering, normalization with respect to RMS energy, etc.

a=mirtempo(’myfile’,’Sampling’,11025,

’Trim’,’Extract’,2,3,

’Center’,’Normal’) (33) mirspectrum(a) (34)

? Batch analyses of audio files can be carried out by simplyreplacing the name of the audio file by the keyword’Folder’.mirspectrum(’Folder’) (35)

? Any vector v computed in Matlab can be converted into awaveform using, once again, the miraudio function, byspecifying a specific sampling rate.

a=miraudio{v,44100) (36) mirspectrum(a) (37) ? Any feature extraction can be based on the result of a previouscomputation. For instance, the autocorrelation of aspectrum curve can be computed as follows:

s=mirspectrum(a) (38) as=mirautocor(s) (39) ?Product of curves [10] can be performed easily:

mirautocor(a)*mirautocor(s) (40) In this particular example, the waveform autocorrelationmirautocor(a) is automatically converted to frequencydomain in order to be combined with the spectrum autocorrelationmirautocor(s).

4. MIRTOOLBOX COMPARISON TO MARSYAS

Marsyas is a framework written in C++ and Java for prototy pingand experimentation with computer audition applications [1].It provides a general architecture for connecting audio, soundfiles,signal processing blocks and machine learning. The architectureis based on dataflow programming, where computation is expressed as a network of processing nodes/components connected by a number of communication channels/arcs. Users can build

第 11 页 共 28 页

their own data flow network using a scripting language at run-time. Marsyas provides a framework for building applications rather than a set of applications [1] 7 Marsyas executables operate either onindividual soundfiles or collections which are simple text files that contain lists of soundfiles. In general collection files should contain soundfiles with the same sampling rate as Marsyas doesn’t perform automatic sampling conversion (except between 44100Hz and 22050Hz). The results of feature extraction processes are stored in Marsyas as text files that can be used later in the Weka machine learning environment. In parallel, Marsyas integrates some basic machine learning components.

Also MIRtoolbox offers the possibility of articulating processone after the other in order to construct complex computation, using a simple and adaptive syntax. Contrary to Marsyas though,MIRtoolbox does not offer real-time capabilities. On the otherhand, its object-based architecture (paragraph 4.2) enables a significant simplify cation of the syntax. MIRtoolbox can also analyse folders of audio files, and can deal with folder of varying sampling rates without having to perform any conversion. The data computed by the MIRtoolbox can be further processed directly in the Matlab environment with the help of other toolboxes, or can be exported into text files.

5. AVAILABILITY OF THE MIRTOOLBOX

Following our first Matlab toolbox, called MIDItoolbox [18], dedicated to the analysis of symbolic representations of music, the MIRtoolbox is offered for free to the research community. It can be downloaded from the following URL:

第 12 页 共 28 页

http://www.cc.jyu.fi/~lartillo/mirtoolbox 6. ACKNOWLEDGMENTS

This work has been supported by the European Commission (NEST project ―Tuning the Brain for Music\code 028570). The development of the toolbox has benefitted from productive collaborations with the other partners of the project, in particular TuomasEerola, Jose Fornari, Marco Fabiani, and students of our department.

第 13 页共 28 页

7. REFERENCES

[1] G. Tzanetakis and P. Cook, ―Marsyas: A framework for audio analysis,‖ Organized Sound, vol. 4, no. 3, 2000.

[2] M. Slaney, ―Auditory toolbox version 2,‖ Tech. Rep., Interval Research Corporation, 1998-010, 1998.

[3] I. Nabney, Springer Advances In Pattern Recognition Series,chapter NETLAB: Algorithms for pattern recognition, 2002.

[4] J. Vesanto, ―Proceedings of the matlab dsp conference,‖ in Self-Organizing Map in Matlab: the SOM Toolbox, 1999, pp. 35–40.

[5] G. Tzanetakis and P. Cook, ―Multifeature audio segmentation for browsing and annotation,‖ in Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1999.

[6] A. Rauber E. Pampalk and D. Merkl, ―Content-based organization and visualization of music archives,‖ in Proceedings of the 10th ACM International Conference on Multimedia, 2002, pp. 570–579.

[7] E. Terhardt, ―On the perception of periodic sound fluctuations (roughness),‖ Acustica, vol. 30, no. 4, pp. 201–213,1974.

[8] P. Boersma, ―Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,‖ IFA Proceedings, vol. 17, pp. 97–110, 1993. [9] T. Tolonen and M. Karjalainen, ―A computationally efficient multipitch analysis model,‖ IEEE Transactions on Speech and Audio Processing, vol. 8, no. 6, pp. 708–716, 2000. [10] G. Peeters, ―Music pitch representation by periodicity measures based on combined temporal and spectral representations,‖in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2006.

[11] L. Rabiner and B. H. Juangl, Fundamentals of Speech Recognition, Prentice-Hall, 1993. [12] E. Gomez, ―Tonal description of polyphonic audio for music content processing,‖ INFORMS Journal on Computing, vol. 18, no. 3, pp. 294–304, 2006.

[13] C. Krumhansl, Cognitive Foundations of Musical Pitch, Oxford University Press, 1990.

第 14 页 共 28 页

[14] C. Krumhansl and E. J. Kessler, ―Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys,‖ Psychological Review, vol. 89,pp. 334–368, 1982.

[15] P. Toiviainen and C. Krumhansl, ―Measuring and modeling real-time responses to music: The dynamics of tonality induction,‖Perception, vol. 32, no. 6, pp. 741–766, 2003.

[16] P. Toiviainen and J.S. Snyder, ―Tapping to bach: Resonancebased modeling of pulse,‖ Music Perception, vol. 21, no. 1,pp. 43–80, 2003.

[17] J. Foote and M. Cooper, ―Media segmentation using selfsimilarity decomposition,‖ in Proceedings of SPIE Storage and Retrieval for Multimedia Databases, 2003, number 5021, pp. 167–175.

[18] T. Eerola and P. Toiviainen, ―MIR in Matlab: The Midi Toolbox,‖in Proceedings of 5th International Conference on Music Information Retrieval, 2004, pp. 22–27.

[19] P. N. Juslin, ―Emotional communication in music performance:A functionalist perspective and some data,‖ Music Perception, vol. 14, pp. 383–418, 1997.

[20] K. R. Scherer and J. S. Oshinsky, ―Cue utilization in emotion attribution from auditory stimuli,‖ Motivation and Emotion,vol. 1

第 15 页 共 28 页

h=mirhisto(t) (15)

整个过程都能被执行在一个单独的线,通过调用mirtempo功能直接与音频输入作为参数:

mirtempo(a,’Frame’) (16)

在这种情况下,不同的可适用选择的在整个过程可以被直接指定为数组元素的节奏功能。例如,以框架为基础的速度估计的计算,选择最好的是三名候选人在每一帧,选择节奏的范围在坯料之间的60 - 120次/分,一个估计的策略基于频谱分析和自相关的混合应用于谱能量与句法会被执行为:

mirtempo(a,’Frame’,’Total’,3,

’Min’,60,’Max’,120,’Spectrum’,

’Autocor’,’SpectralFlux’)

(17) 2.4 分割

更精细的工具也被实施,摇晃的出更高水平的分析和转化。特别是,音频文件可以被自动分割为一系列的均匀部分,通过时态的间断的估计沿着多样化的替代的功能,例如音质特别[17]。首先声音信号分解成帧(18)和一个被选中特征如MFCC(19),被计算是沿著这些帧。这基于在相似度矩阵(20)对所有可能的框架之间距离存放。卷积沿着主对角线利用相似矩阵的高斯棋盘的内核中收益率曲线表明了一种新奇的时间的位置具有重要的参考价值质地变化(21)。峰值检测应用于新奇返回的时间位置曲线特征的间断(22),它也可以应用于实际的分割音频序列(23)。

fr=mirframe(a) (18)

fe=mirmfcc(fr) (19)

sm=mirsimatrix(fe) (20)

nv=mirnovelty(sm)

第 21 页 共 28 页

(21)

ps=mirpeaks(nv) (22)

sg=mirsegment(a,ps) (23)

整个分割过程可以被执行一个单个的线通过直接调用mirsegment函数作为音频输入数据的论点:

mirsegment(a,’Novelty’) (25)

默认的情况下,新奇的曲线图表是基于MFCC曲线,但其他的特性也可以被选择同样的使用一个附加的选择:

mirsegment(a,’Novelty’,’Spectrum’) (26)

一秒钟就可以计算相似度矩阵,以展现距离——根据相同的特性,对比一个人使用的分割-之间所有可能的环节(28日)。

fesg=mirmfcc(sg) (27)

smsg=mirsimatrix(fesg) (28)

2.5 数据分析

作为数据分析这个工具箱包含多样的工具,例如一个高峰撷取器,和计算直方图,熵、零利率的功能,不符合规定或者各种统计时刻(质心,偏态,峰态等传播、平整度)各种类型的,例如当作光谱,信封或直方图。

Mirpeaks功可以接受任何数据传回通过MIRtoolbox的其他功能和能适应不同的种类的数据在任何数量的尺寸。在图形表示法的结果中,山峰都是自动位于相应的曲线(为一维数据)或bit-map图像(2D数据)。

Mirpeaks功能提供了可供选择的可能的试探法。它可能定义为山峰必须被选定超过他们的一个全球性的门槛。我们设计了一个新的峰的选择策略,是基于丢弃的山峰上那不充足的对比I根据某阈值)与邻近的山峰上对比的观念的形成。该自适应滤波曲线策

第 22 页 共 28 页

略因为适应当地的曲线的特殊性。它的发音与其他更为传统阈值策略导致一种高效的峰值采摘模块,它可以应用于整个MIRtoolbox。

悦耳的音乐样品的监督分类可以被演示,利用技术如再邻居或高斯混合模型。一个可能的应用是音频资料的分类到音乐的类型。

3 工具箱的设计

3.1 数据封装

在工具箱里所有的数据被返回通过功能都是封装的进入类型的对象。默认的显示方式是联系到所有这些物体是一种相应的曲线的图形的显示。在这种方式下,当显示器的价值观的一个给定的分析请求,什么是印刷的而不是一个的长向量和矩阵,而是一个正确的格式的图形表示法。

实际数据矩阵的相关数据可以被获得通过调用一个方法叫做mirgetdata,构成了尽可能简单的数据结构的数据(cf.相关段落4.1) 3.2 框架分析

基于帧的分析(即对使用一个滑动窗口为基础)可以指定使用两种替代方法:第一种方法是根据分解一到连续的帧的音频信号的mirframe使用功能,可选参数可以指定帧大小(以秒,默认情况下),跳因素(介于0和1,默认情况下)。例如,在下面的代码(29行)框架有50毫秒的,一半大小相互重叠。

该函数的结果便可以直接发送作为任何一个工具箱(30)其他函数的输入例如: f=mirframe(a,.05,.5) (29)

mirtempo(f) (30)

然而,这第一个方法不正确,例如工作时与处理速度估计在2.4节所述。以下这第一种方法,如图4,框架分解第一步是在进程执行链,因此,该滤波器分解输入是一个短帧系列,其中包括两个主要的问题,第一为了避免短暂在每学年开始存在的不良状况过滤的帧,每个过滤器的初始状态对不同的过滤器在一个的特定瞬间前一帧(不同的重叠因素)将需要调整。第二:对分解的框架裁员demultiplication(如果帧重叠)在整个多渠道该滤波器将需要根据所造成的内存使用空间。如果框架分解分解后的执行和滤波器重构的技术困难和浪费内存诱导在第一个方法就可以立即解决,如图5所示:

第 23 页 共 28 页

第二种方法,在这方面更成功,不能使用以前的管理语法,作为mirtempo输入职能不应框架尚未分解,另一种选择语法包括在建议的框架分解选项作为一个可能的论点的mirtempo功能('框架')(31)。这相当于什么,提出了在第2.4(代码16路和17)。

mirtempo(a,’Frame’,.05,.5) (31) 该框架分解的选项是可以作为实现工具箱的大部分功能一个可能的论点。然后每个函数可以指定在其行动链框架分解的确切位置。此外,如果没有指定默认参数的框架将会分解。帧尺寸和跳因子-可适应每一个具体的功能。因此,从用户的角度执行点和链接不同运营商的MIRtoolbox遵循相同的语法,框架分解与否,除额外使用无论是命令mirframe或选项'框架'的框架分解。当然,从开发的角度来看,这就要求每一个特征提取算法应该适应帧分解的投入。更确切地说,输入可以是一个单一向量或矩阵,其中列代表连续帧。方便的是,在Matlab环境下,基于矢量的算法的泛化以矩阵为基础的版本一般是不费吹灰之力。 3.3 适合的语法

正如以前的解释,工具箱的各种种样的功能可以接受替代的输入:

?一个特定的音频文件的名字(或在音频文件格式wav或非盟)可以直接指定作为神经网络的输入:

mirspectrum(’myfile’) (32)

?语音文件可首先使用miraudio加载不同作用的功能,进行重新采样等施工作业,自动修剪过的沉默开始和/或结束的时候,一个给定序列中提取数列,集中、规范化等方面存在着一定的有效值能源等。

a=mirtempo(’myfile’,’Sampling’,11025,

第 24 页 共 28 页

’Trim’,’Extract’,2,3,

’Center’,’Normal’)

(33)

mirspectrum(a) (34)

?音频文件批量分析可以进行简单的生活更换音频文件的名称通过关键字“文件夹中”。

mirspectrum(’Folder’) (35)

?在Matlab计算任何向量v可以被转化一些文件,再一次,波形的使用功能,通过miraudio可以被指定一个特定的采样率。

a=miraudio{v,44100) (36)

mirspectrum(a) (37)

?任何特征提取可以基于先前的计算结果。例如,自相关函数的一个光谱曲线就可以计算如下:

s=mirspectrum(a) (38)

as=mirautocor(s) (39)

?曲线[10]的产品都可以容易地进行:

mirautocor(a)*mirautocor(s) (40)

在这个特定的例子中,波形自相关mirautocor(a)也会被自动转换为频率域以便相结合频谱自相关mirautocor(s)。

4. 在Marsyas工具箱对比

Marsyas是一个以c++和Java为原型来进行设计和用计算机试听应用软件来进行实验的一个框架,它通常是由连接音频、声音档,信号处理模块和机器学习组成。该模块基于数据流编程,计算法由一个网络节点连接处理/部件通过一系列的沟通渠道/弧表示。

第 25 页 共 28 页

用户可以建立自己的数据网络使用的脚本语言运行时间。Marsyas提供了一个框架,用于建设应用而不是Marsyas文件的一套应用操作或者在个人声音档或收藏简单文本文件包含soundfiles名单。收藏文件应该包含在soundfiles具有相同的采样率为Marsyas没有执行自动取样44100赫兹之间转换(除了22050赫兹)。特征提取过程的结果都存储在Marsyas作为文本文件可以用来在Weka机器学习的环境。并联Marsyas集一些基本的机器学习的组件。

MIRtoolbox也提供了用一个简单和自适应语法来构建一个接一个复杂计算的可能性。与其相反,MIRtoolbox Marsyas不提供实时的能力。另一方面,其object-based架构使得显著简化语法。MIRtoolbox也可以分析音频文件的文件夹,并能处理不同的采样率文件夹,不需任何转换。MIRtoolbox数据数值可以进一步处理,能够在在Matlab环境下直接的帮助,或者其他工具箱可以出口到文本文件。

5. MIRTOOLBOX的有效性

紧跟着第一个Matlab工具箱的是MIDItoolbox,它致力于分析因音乐的符号表征,而MIRtoolbox是免费提供给研究团体的。它可以从下面的网页下载: http://www.cc.jyu.fi/~lartillo/mirtoolbox

6. 感谢

这份工作得到欧盟委员会的支持(雀巢杯“用音乐来调整大脑”代码028570)。工具箱的发展也从与其他合作伙伴生产合作项目,特别是Eerola Tuomas,何塞。Fornari,马可Fabiani,以及我们部门的学生的合作中得得到了益处。

第 26 页 共 28 页

7. 参考文献

[1] G. Tzanetakis and P. Cook, ―Marsyas: A framework for audio analysis,‖ Organized Sound, vol. 4, no. 3, 2000.

[2] M. Slaney, ―Auditory toolbox version 2,‖ Tech. Rep., Interval Research Corporation, 1998-010, 1998.

[3] I. Nabney, Springer Advances In Pattern Recognition Series,chapter NETLAB: Algorithms for pattern recognition, 2002.

[4] J. Vesanto, ―Proceedings of the matlab dsp conference,‖ in Self-Organizing Map in Matlab: the SOM Toolbox, 1999, pp. 35–40.

[5] G. Tzanetakis and P. Cook, ―Multifeature audio segmentation for browsing and annotation,‖ in Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1999.

[6] A. Rauber E. Pampalk and D. Merkl, ―Content-based organization and visualization of music archives,‖ in Proceedings of the 10th ACM International Conference on Multimedia, 2002, pp. 570–579.

[7] E. Terhardt, ―On the perception of periodic sound fluctuations (roughness),‖ Acustica, vol. 30, no. 4, pp. 201–213,1974.

[8] P. Boersma, ―Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,‖ IFA Proceedings, vol. 17, pp. 97–110, 1993. [9] T. Tolonen and M. Karjalainen, ―A computationally efficient multipitch analysis model,‖ IEEE Transactions on Speech and Audio Processing, vol. 8, no. 6, pp. 708–716, 2000. [10] G. Peeters, ―Music pitch representation by periodicity measures based on combined temporal and spectral representations,‖in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2006.

[11] L. Rabiner and B. H. Juangl, Fundamentals of Speech Recognition, Prentice-Hall, 1993. [12] E. Gomez, ―Tonal description of polyphonic audio for music content processing,‖ INFORMS Journal on Computing, vol. 18, no. 3, pp. 294–304, 2006.

[13] C. Krumhansl, Cognitive Foundations of Musical Pitch, Oxford University Press, 1990.

第 27 页 共 28 页

[14] C. Krumhansl and E. J. Kessler, ―Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys,‖ Psychological Review, vol. 89,pp. 334–368, 1982.

[15] P. Toiviainen and C. Krumhansl, ―Measuring and modeling real-time responses to music: The dynamics of tonality induction,‖Perception, vol. 32, no. 6, pp. 741–766, 2003.

[16] P. Toiviainen and J.S. Snyder, ―Tapping to bach: Resonancebased modeling of pulse,‖ Music Perception, vol. 21, no. 1,pp. 43–80, 2003.

[17] J. Foote and M. Cooper, ―Media segmentation using selfsimilarity decomposition,‖ in Proceedings of SPIE Storage and Retrieval for Multimedia Databases, 2003, number 5021, pp. 167–175.

[18] T. Eerola and P. Toiviainen, ―MIR in Matlab: The Midi Toolbox,‖in Proceedings of 5th International Conference on Music Information Retrieval, 2004, pp. 22–27.

[19] P. N. Juslin, ―Emotional communication in music performance:A functionalist perspective and some data,‖ Music Perception, vol. 14, pp. 383–418, 1997.

[20] K. R. Scherer and J. S. Oshinsky, ―Cue utilization in emotion attribution from auditory stimuli,‖ Motivation and Emotion,vol. 1

第 28 页 共 28 页

本文来源:https://www.bwwdw.com/article/pjkg.html

Top