Quality as an image-specific characteristic perceived by an average human observer

Non-reference image quality measures. Blur as an important factor in its perception. Determination of the intensity of each segment. Research design, data collecting, image markup. Linear regression with known target variable. Comparing feature weights.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 23.12.2015

Размещено на http://www.stud.wiki/

Размещено на http://www.stud.wiki/


Ranking images by their quality is one of the most common challenges in many areas of applied science and technology. For example, a set of images returned by a web search may have good search relevance, but if these relevant images also have the best quality, this would certainly improve users' impression. Another area is medicine, where patient examinations result in terabytes of visual data which is hard to analyze at a time. This is why preprocessing this data by extracting the best images for further diagnostics will be a timesaving solution for physicians. Finally, if we know what defines image quality numerically, we can start developing quality enhancing filters, making imaging data more appealing to human visual system.

Image quality is a complex concept which might have different interpretations. In our work we consider quality as an image-specific characteristic perceived by an average human observer. Thus, an image of good quality corresponds to our general idea of regular, informative, and well presented. Presently, it's common to measure image quality with a single metric like contrast, blurriness etc.

Our goal is to provide a more complex and formal definition of human quality perception by identifying the top factors responsible for visual quality. To eliminate any subjectivity, we consider quality as an objective, non-reference multidimensional measure, that we want to be able to compute independently without comparing the image to the others. Our practical goal is to find a restricted set of features that are most responsible for quality perception. Such a set would become a first step in solving a practical issue of creating a useful tool for displaying medical images improving their quality.

1. Non-reference image quality measures

Most of research published on image quality uses quality measures estimated for original image and its distorted copies []. In this study we use so called non-reference measures when quality is estimated for single image independently. We use a number of previously developed measures and a number of basic measures like contrast as described below.

1.1 Blurriness measures

Even partly blurred image affects human perception of quality. That is why we consider blurriness as an important factor of image quality perception. In this work we use two different blurriness measures.

The first one described by F. Crete and T. Dolmiere [1] uses low-pass filter and is based on principle that gray level of neighboring pixels in a less blurred image changes with higher variation than in its blurred copy. So, they compute absolute vertical and horizontal difference D for neighboring pixels in original and blurred images (1):

(Eq. 1a)

(Eq. 1b)

where I(x,y) is the intensity value at the (x,y) pixel, h and w are height and width of image. After that, variation of neighboring pixels before and after blurring needs to be analyzed: if variation is high, the original image is considered to be sufficiently sharp. To evaluate variation, we consider only the differences that decreased, and obtain variation V for vertical and horizontal directions (2):

(Eq. 2)

where DB_ver(x,y) is the absolute difference for blurred image B.

Finally, blurriness for vertical direction is computed:

(Eq. 3)

Horizontal blurriness is computed in the same way. Finally, maximum of two is selected as the final blurriness measure: Fblur = max(Fblur_hor, Fblur_ver). Further we will write it as Fblur_1.

Another blurriness measure was presented by Min Goo Choi [2], based on edge extraction using intensity gradient. The authors define horizontal and vertical absolute difference value of a pixel computed as a difference between its left and right or upper and lower neighboring pixels. Then they obtain the mean horizontal and vertical absolute differences Dhor_mean for the entire image as in (Eq. 4).

(Eq. 4)

Then each pixel value is compared with mean absolute horizontal difference values computed for the whole image to select edge candidates as Chor(x,y):

(Eq. 5)

If candidate pixel Chor(x,y) has absolute horizontal value larger than its horizontal neighbors, this pixel will be classified as edge pixel Ehor(x,y) as shown in (6).

(Eq. 6)

Each edge pixel is examined to find whether it corresponds to a blurred edge or not. First, horizontal blurriness of a pixel is computed according to (7).

(Eq. 7)

Vertical value is obtained in the same way, maximum of two is selected for final decision. Pixel is considered blurred if its value is larger than a predefined threshold (0.1 suggested in the paper).

(Eq. 8)

Finally, the resulting measure of blurriness for the whole image is called inversed blurriness and is computed as a ratio of blurred edged pixels count to edge pixels count (9).

(Eq. 9)

Further we will term this measure Fblur_2 to discern it from blur described in [1].

We assume that increase of blurriness should negatively affect quality perception because a very blurred image will loose important information and be less attractive.

1.2 Image entropy

The basic idea behind entropy is to measure the uncertainty of the image. The more information and less noise the image contains, the more useful it would be, and we might relate image usefulness to its objective quality. In our study Shannon entropy Formula used to calculate entropy is taken from Wikipedia was computed for the entire image, its foreground, and its background according to (Eq. 10).

(Eq. 10)

where p(Ik) is the probability of the particular intensity value Ik.

We assume that higher entropy should mean that more signal is contained in the image. For example, if there are less details and mode plain surfaces, entropy would be less. However, noisy image would have more entropy, so we will consider entropy for three levels of image.

1.3 Segmentation

Presented in [3], it shows how much various segments of image can be separated. We use the simplest yet most intuitive implementation comparing two major segments: image background (seg1) and foreground (seg2). In case of this study we simply computed average intensity value and used classified all pixels with lower intensity as background, while the rest of pixels was foreground. To compute segmentation measure, average difference U for neighboring pixels in 3x3 sliding window is computed for each image segment (Eq. 11):

(Eq. 11)

leading to the following measure W:

(Eq. 12)

Then we compute average pixel intensity in each segment and obtain squared difference between average intensities of very pair of segments - in our case there is only one pair. Inversed sum of squared differences of average intensities is called B:

(Eq. 13)

Resulting measure is obtained as:

Fsep = 1000*W+B (Eq. 14)

and it will be high for images with high separability between segments and low separability within segment. In our case this measure makes sense only for one set of images depicting trees because another set of medical images mostly presents dark background, which is clearly separated from the foreground.

1.4 Flatness

This measure is described in [4] and uses two-dimensional discrete Fourier transform of the image. First, we obtain 2D Discrete Fourier Transform of the image, which is transformed to one-dimensional vector FV. Next, spectral flatness SF is computed as ratio of geometric to arithmetic mean:

(Eq. 15)

The resulting measure proposed in the paper is called entropy power and is obtained as a product of spectral flatness measure SF presented in (15) and image variance as shown:

(Eq. 16)

where is average intensity value for the image. This measure is assumed to be higher for less informative, non-predictive and redundant images.

1.5 Sharpness

This measure [5] is based on assumption that differences of neighboring pixels change more in the areas with sharp edges. Therefore the authors compute second-order difference for the neighboring pixels as a discrete analog of second derivative for the image passed through denoising median filter:

, (Eq. 17)

where Im the original image passed through median filter

Then authors define vertical sharpness for each pixel Sver as shown below:

, (Eq. 18)

and each pixel is treated as sharp if its sharpness exceeds 0.0001. Number of sharp pixels NSver is computed, and the edge pixels are found with Canny method, number NEver being their count. Then the same process is repeated in the horizontal direction, and the sharp to edge pixels ratio for vertical and horizontal directions is computed as:

(Eq. 19)

We assume that sharper image should be percepted as a more attractive and informative.

1.6 Blockness measure

This measure estimates image from the point of block artifacts [6]. Absolute intensity differences for neighboring pixels are obtained for vertical and horizontal directions as shown in (1), each element of resulting matrix is then normalized:

(Eq. 20)

By taking the average for each column of matrix we obtain the horizontal profile of image Phor as shown in (Eq.21):

(Eq. 21)

The vertical profile is assessed in the same way, and1-D DFT is applied to both profiles. Magnitude M of DFT coefficients is than considered:

(Eq. 22)

where 0 T w-2.

Vertical blockness measure Bl for the block size Z is computed as shown in (Eq. 23). Due to DFT nature, Mhor(T) will have peaks at T, where number b=1,2…Z. Values for Mhor(T) at these peak points correspond to horizontal blockness of image Blhor:

(Eq. 23)

Vertical blockness measure can be obtained similarly. In our study 2, 4, 6 and 8 pixels were used as block width. Resulting measure is shown in (24):

, (Eq. 24)

where r and 1-r are weights for horizontal and vertical measures. We use r equal to 0.5. This measure will be higher for images distorted with block artifacts.

1.7 Fractal dimension

The idea of possible relation between image quality and amount of image details brings us to the measures of fractal dimensions. We detect main contours in the image using Canny method and then estimate fractal dimension of the obtained curve. We use box-count to compute dimension (Eq. 25). N stands for the number of square blocks with side е with е =2, 3, 4, and 5.

(Eq. 25)

We assume that higher values measure of fractal dimension would correspond to more informative images containing more information.

1.8 Noise level

It is natural to assume that the presence of noise can be detrimental for the perceived image quality. Therefore we included a noise measure developed by Masayuki T. [7]. In this work, noise level is described as standard deviation of the Gaussian noise. The authors propose a patch-based algorithm. First, the original image is decomposed into overlapping patches, and the model for the whole image is written as pi = zi+ni, where zi is the original image patch with i-th pixel in its center transformed to a one-dimensional vector, and pi is the observed patch (also transformed to vector) distorted by Gaussian noise which is presented as vector ni. To estimate noise level we need to obtain unknown standard deviation using only the observed distorted noisy image.

All the image patches are treated as data in Euclidean space, its variance can be projected onto single axis which direction is defined by vector u. Variance of data V projected on u can be written as:

(Eq. 26)

where is standard deviation of the Gaussian noise.

Minimum variance of data direction is than defined using Principal Component Analysis (PCA). First, data covariance matrix р is defined as:

(Eq. 27)

where b is number of patches, m is the average in dataset {pi}. Then the variance of the original data is projected onto minimum variance direction equals the minimum eigenvalue :

, (Eq. 28)

where ? is covariance matrix for noise-free patches z.

The noise level can be estimated if we decompose minimum eigenvalue of the noisy patches covariance matrix, which is an ill-disposed problem because minimum eigenvalue for noiseless patches covariance matrix is unknown. Then the authors suggest selecting weak textured patches from noisy images because such patches span low-dimensional space and minimum eigenvalue of their covariance matrix is close to zero, so their noise level Fnoise can be estimated as:

, (Eq. 29)

where р' is the covariance matrix for weak textured patches.

Undoubtedly, the most important part of the proposed algorithm is the selection of weak textured patches. The main idea is to compare maximum eigenvalue of gradient covariance matrix of patch with some threshold. Gradient covariance matrix C of patch j is computed as:

(Eq. 30)

where Gj = [Dhorj, Dverj] and Dhor and Dver are horizontal and vertical derivative operators.

To select weak textured patch, statistical hypothesis is tested. Null hypothesis (patch has weak flat texture) is accepted if its gradient covariance matrix Cj maximum eigenvalue is less than threshold. Threshold ф for maximum eigenvalue of gradient covariant matrix can be found as:

(Eq. 31)

where is the significance level (we use 0.99), is the inverse-gamma cumulative distribution function with shape parameter b/2 and scale parameter. Inverse-gamma cumulative distribution function is defined as:

(Eq. 32)

where Г(.) denotes gamma function, is a scale parameter, is a shape parameter. Gamma function for positive integer n is defined as:

(Eq. 33)

We assume that noisier images would have worse quality and would be less informative.

1.9 Average gradient and edge intensity

Both measures are taken from [8]. Average gradient FAG shows how pixel values change on average for vertical and horizontal directions according to:

(Eq. 34)

Edge intensity FEI is computed as:

where Gver and Ghor are vertical and horizontal gradients obtained as: (Eq. 35)

(Eq. 36)

(Eq. 37)

Finally, we use a number of simple image quality metrics. First of all, average intensity FAI is computed as:

(Eq. 38)

Overall image contrast FC and contrast per pixel FCPP are obtained as:

(Eq. 39)

(Eq. 40)

Table 1. Correspondence between described measures and names of features in our dataset. 0, 1 and 2 prefixes relate to images on three levels of Laplacian pyramid



Corresponding variables

No-reference blur metric


Blur10, blur11, blur12

Min Goo Choi method


Blur20, blur21, blur22

Shannon entropy


Ent10, ent11, ent12

Local Shannon entropy

EntB0, entF0

Separability measure


Sep0, sep1, sep2

Flatness measure


Flat0, Flat1, Flat2



Sharp0, sharp1, sharp2



Contr20, contr21, contr22

Blockness measure


Block20, block40, block60, block80, block21 etc

Fractal dimension


Frac0, frac1, frac2

Average intensity


Intens0, intens1, intens2

Noise level


Noise0, noise1, noise2

CPP - contrast per pixel


Contr10, contr11, contr12

Average gradient


AG0, AG1, AG2

Edge intensity


EI0, EI1, EI2

2. Research design, data collecting and image markup

In order to evaluate the performance of various quality measures and validate the results, we used two datasets of grayscale images of different nature and quality. Each image quality was assessed two times: first by human observers (thus capturing our visual perception of the image quality), and second, but a set of metrics described above. The metrics were applied to the original images as well as their lower-resolution copies, derived with Laplacian pyramid decomposition, which produced the total of 57 quality metric measurement per each image. Our main intention was to find the best sets of numerical metrics that would explain the observed human perception of image quality.

Each image dataset used in this work consisted of similar images: the first set had 50 medical images (CT tomography of an abdomen), and the second - 50 scenery photographs of trees and forest landscapes. We intentionally chose the images of rather abstract and “emotion-free” nature to exclude any subjective bias in the human perception.

The human perception ranks for the images were obtained with pairwise comparisons between all images in each dataset. The images were presented in random pairs to 15 human spectators, asking them to choose the best of the two. This task was implemented using Amazon Mechanical Turk technology; Figure 1 Mechanical Turk assignment for image markup shows screenshot of assignment.

To ensure comparison robustness, we used markup with triple overlap: each pair of images was compared three times by different observers; final choice computed using the majority rule. As a result, more than 7000 pairs were presented and compared.

To get image features, 19 basic quality measures were computed for three copies of each image: the original image and its two lower-resolution derived as two levels of the Laplassian pyramid. The resulting 57 measurements were treated as57-dimensional image feature vectors, used as independent variables in models.

Figure 1. Mechanical Turk assignment for image markup

3. Experimental Results

3.1 Linear regression with known target variable

On the first step of research we are trying to solve our task using known quality measures of every image. In such approach we are trying to fit models to predict known outcome.

Based on the pairwise image comparison results we computed a quality index for every image as the number of this image's wins divided by the number of comparisons. This allowed us to put the images in a linear quality order. Note that in general this linear order cannot correspond to all the recorded comparisons: in some instances an image with a higher quality index might have been perceived as inferior when compared with some lower-quality image. This non-linearity in image grades originated from the differences in quality perception between different human observers, and we called such image pairs inverted. Overall, 10% of pairs were inverted in medical dataset and 14% were inverted in trees dataset.

Using linear quality indices (rankings) as a target variable, we implemented linear regression with L2 norm as a basic model. We considered all possible regression models containing various combinations of k, k = 1…57 features, and extracted the best models for each k as providing the least regression error. Note that this resulted in an exhaustive search through millions of possible models (feature combinations), therefore we used branch-and-bound algorithm to speed-up the search,

Normalized regression error for L2 regression, E, was defined as:

(Eq. 41)

where Wp stands for the model-predicted image quality, and W - for the real observed quality.

One of main goals of the study was to find a set of factors that are responsible for the human perception of the image quality. We validated our feature-modeling results using medical (MS) and trees (TS) image datasets separately to make sure that models that perform well for one dataset would be good for another dataset.

Figure 2 shows various models for 1, 2, 3, 4 and 5 features. We used R squared as a metric to evaluate each model as a measure of the fraction of the original data variation explained by model. Treating the concept of image quality as a function of our visual perception rather than image selection, we therefore assumed that a good model should perform well for both MS and TS datasets. Figure 2 Regression models for both datasets visualizes our results. As one can see, R squared is not increasing dramatically after using more than 6 features, so we show only the models with up to 5 predictors. Circle sizes correspond to average error in each model. Largest circles are close to 0.27 while the best models have errors close to 0.08.

Figure 2 Regression models for both datasets

You can also observe that the circles on the plot tend to cluster along the diagonal line, which means that most models perform similarly on both MS and TS datasets. Moreover, the higher is k (the number of model features/predictors), the closer are circles to the diagonal line. As a result, higher k generally corresponds to more accurate and more image-independent models, which can provide optimal quality predictions for both MS and TS sets.

Figures 3 a, b illustrate best models obtained for MS and TS independently. As the figure indicates, the models selected as the best for one dataset perform well on the other. This already can be viewed as a strong demonstration of the objectivity in the human image quality perception: despite the obvious differences between the images of CT scans and forest landscapes, the models optimal for one set were among the best performers for the other.

Figure 3 a, b

Finally, Figure 4 demonstrates top ten models for each model size, sorted by the average mean error on two datasets. It can be seen that most models lie on the diagonal line, models with 4, 5 and 6 features becoming increasingly closer to each other due to high R square for both datasets.

Table 2 summarizes the best predictors selected for each number of features defined in Table 1. It provides us with some significant insights. First of all, there is a limited set of quality measures which occur in most optimal models derived for MS and TS data. It can be assumed that these factors play the most important role in our perception of the image quality:

Figure 4 Best ten models for both sets

· Entropy power of the image on first and second levels of Laplacian pyramid (metrics flat0, flat1). It is a product of spectral flatness and variance of the image and shows image signal compressibility, reflecting how much useful signal is contained in the image.

· Entropy of the background (entB0, entB1) and entropy of the whole image, present in many optimal models for both sets

· Blockness measures for all block sizes (of 2, 4, 6 and 8 pixels) are important for all sets of images on all three levels of pyramid

· Both blur measures, sharpness, contrast and edge intensity measures on all resolution levels are significant for all datasets, proving that that perception of contrast and blurriness is one of major image quality metrics.

· Fractal dimension on all levels of image resolution can be found in models for both sets.

· Average gradient is especially important for trees dataset. This measure shows how much pixel values change on average. According to it, images with more contrast edges between objects get higher mark.

· Object separability on first and second levels of pyramid can be found in models for both sets. This measure is higher for images with distinguishable and more contrast parts.

As a result, we identify the following major factors responsible for the human perception of image quality:

· Amount of information contained in image, which can be described by spectral flatness and entropy measures. It is remarkable that random noise is not taken into account, while larger objects have some impact.

· Contrast, average gradient and blurriness are the most important non-reference quality measures that affect visual perception of the whole image, while sharpness and noise level hardly appear in the best models. This might be explained by sensitivity of used metrics.

· Artifact measures like blockness appears to be significant in most models.

· Background entropy performs well only as a add-on factor which explains the variance that was not already covered by the other factors

All things considered, we obtained models containing restricted sets of features that are able to explain quality perception. However, basic matrix of comparisons is our ground truth and main source of information. To measure quality of described approach, we compared each pair of images by predicted quality measures computed by best models of five features mentioned above. To get vector of predicted values we performed leave-one-out cross validation for each of the two sets. This procedure enabled us to get more stable resulting vector of quality measures. On each step one image was separated from other images, so the model weights were learned using the rest of images to predict quality measure for a single image. Final vector of model quality measures was constructed of predicted values and normalized.

Average share of inverted pairs computed for predicted quality measures in comparison to initial matrix is 31% for medical images and 29% for trees. However, this result is far from original and could be improved.

Table 2 Best predictor values for models with restricted sets of factors. Table contains best three models according to average error on two datasets

Model size N

Best L2 predictors for both datasets

Best L2 predictors for trees dataset

Best L2 predictors for medical dataset


· Blur10, Blur 12

· Sep0

· Blur20, blur22

· Intens2,

· EntF0

· AG1

· Sharp1

· EI1, EI0

· Ent10

· ent11

· sep0


· Blur20, sep0

· Blur20, sep1

· EntF1, blur20

· Blur20/21, intens0/1/2

· Blur20, entF0

· EntB0, block60

· EntB0, frac0

· Blur20, sep0

· Blur20, sep1

· Blur20, intens0


· Blur20, EntB0, sharp1

· Blur10, Blur11,blur22

· Block measures + blur

· Blur20, EntB0, frac0

· Blur20, entB0, frac2

· entB0, sep0, flat2

· Blur10, blur11,blur22

· Contr20, blur21, noise2

· Contr20, intens0, ent11

· Blur20, block22, block62


· Blur20 + blockness measures

· Blur11, entB1, intens1, block22

· Contr22, noise2, blur21, entB0

· entB0, sep0, block80, flat2

· blur10, entB0, sep0, flat2

· block62, blur20, contr10, block22

· blur20, contr20,block62, block22

5, 6

· entB0, blur21, flat1, EI1, frac2

· entB0, blur21,flat1,EI1, block62

· blur10m entB0, sep0, block40, flat2

· blur10, entB0, sep0, block80, flat2

· Blur20, block60, block62,block22

· Blur20, EI0, EI1, block22, block42

3.2 Checking linearity of image quality perception

As we mentioned before, the reduction of pairwise comparison scores to one-dimensional linear quality indices resulted in 10%-14% of inverted pairs: the instances where linear image quality values would mismatch the result of the image pairwise comparison. Using OLS regression models of five features resulted in 29-30% of inverted pairs.

To improve our results and to account for more arbitrary ways of defining image quality indices, we decided to consider a scenario where there was no original predefined quality order. That is, the basic idea was to consider quality measures as unknown variables and then try to find their optimal values which would satisfy two major criteria: good predictability with linear regression, and lowest number of inverted image pairs.

Besides, we have another issue: in previous part we used linear model of quality. However, linear dependence is not obvious and should be checked. To do this, we tried to use a simple method based on best models obtained on previous step. The idea was to use linear models and enlarge R-squared, minimize error and avoid decreasing the number of inverted pairs. We were using known quality measures from previous step as starting values. If linear model is appropriate, than we should be able to improve target vector to get higher R-square without violating restrictions of initial matrix of comparisons.

To start we looked for the best set of measures which would have the lowest regression error and which will not increase the number of inversions according to initial pairwise comparison matrix. In addition, we tried to decrease the number of inverted pairs with the new set of measures.

To check this we implemented a simple algorithm described below.


Initialisation: the starting set of q1…qn,

Qi is fraction of wins for the i-th image in parwise comparisons.

Iterative process:

WHILE not terminal condition do:

For qi , 1< i < n : 1. Get interval for qi_new: [qi_min, qi_max],

qi_min = max of all qj, qj<qi, iЃ‚j ;

qi_max = min of all qk, qk>qi, kЃ‚i

If qi_min>qi_max:

take sorted array [q1, q2,…qm, qm+1,…qn].

For each interval [qm, qm+1] set qt = (qm +qm+1)/2:

choose qt = argmin (Ninverted_pairs).

qi_min qt,

qi_max qt+1.

If qt provides no more inverted pairs: qi_new = qi.

If qi_min <qi_max: go to step 2.

For each qi in [qi_min, qi_max] with a step 0.1*length of interval:

find optimal qi = argmin (MSE) for linear regression model.


* Repeat steps 1, 2 until R-squared is more than threshold and square error difference on step s and s-1 is less than threshold. To compare error on step s with previous step s-1, fit features weights using vector Qs as a target, obtain model vector Qs_mod and compute errors of Qs-1 and Qs against such vector.

We assumed that in case of nonlinear dependence between quality and features, this algorithm will not converge: the idea of algorithm is to move initial quality measures closer to the model line. If this is possible without violating restrictions existing in the comparisons matrix, then mean square error (MSE) would decrease because model line will fit new quality measures better.

We used best ten models of five features and quality measures from previous section as initial values. However, in all cases it was impossible to decrease the fraction of inverted pairs for more than 2% points. We suggest that this can be caused by peculiarities of human perception and lack of transitivity in pairwise comparisons: it is natural that a person who compares images by two is not able to keep all seen images in mind and provide ideal linear order of them.

We achieved increase of R-square and reached +0.2 improvement without violating conditions. However we hardly achieved R-square more than 0.8. This result still proves that linear model is adequate for explaining quality perception. Figures 5 a b demonstrate average new and old values of quality measures obtained for best models for MS and TS respectively. Pearson correlation between old and new values is around 0.8 which means that new values are a linear combination of initial vector. This result enabled us to use linear models on next step when quality measures are treated as unknown.

Figures 5 a b. Old and new values of quality for MS(a) and TS(b)

3.3 Computing quality measures of images using Elo ratings approach

To improve the initial assignment of the quality indices, we tried one more approach that does not use any initial target vector of quality measures and based on the initial comparisons matrix in order to improve results achieved on previous step.

This approach is based on Elo rating system for chess tournament [9]. Each pair is considered as an independent Bernoulli test where each of two outcomes (winning of image A over image B) has probability p. All comparisons are seen as a series of such tests. Each image in pair has a rating, which determines the outcome of comparison so that image with higher rating wins. Rating of image K is a linear combination of its L features with weights:

(Eq. 42)

Probability of choosing image A in pairwise comparison i or, in other words, probability of image A rating being larger than image B rating is written as logistic function:

(Eq. 43)

The optimal set of features weights would provide ratings that will give the most likely pairwise comparisons. Outcome x of each comparison can be 0 or 1, which can be written using Bernoulli formula where probability P is probability of shown in (Eq. 38):

, x = {0,1} (Eq. 44)

Likelihood function is written as:

(Eq. 45)

To obtain image rankings that would give the most likely pairwise comparisons according to initial matrix, we should iteratively change features weights to maximize logarithm of likelihood which is sum of logarithms of Pi(x) shown in (Eq.39). Optimization was conducted using gradient descent method from SciPy library Implementation described on web-page .

This method was applied to various combinations of five features used in previous method independently on each of image sets in order to compare features and estimate their importance in determining image quality perception. Besides best models for mixed set of images was obtained. To compare models we simply used a rate of truly detected pairwise outcomes, results are presented in Table 3.

We performed ranking approach for possible combinations of five features and looked at best models that provide best results for each set separately and that perform well for both sets. In case of testing model on both sets we use sum of log likelihood for two sets separately, and take average of features weights for two sets. Performance of every model was estimated by number of correct pairwise comparisons according to ratings. They are presented in Table 3.

Table 3 Best predictor values for models with restricted sets of factors. Table contains best three models according to average rate of correct pairwise comparisons


Ratio of correct comparisons




ent11, sharp2, block42, block62, intens2




entB0, entF0, ent11, ent12, block82




Ent10, entF1, block20, noise1, block22




Contr20, ent11, entF2, sharp2, contr12




Blur20, ent10, ent11, block21, block22




Ent10, entF1, block20, noise1, block22




According to the table, some of best models, that perform well on each of sets separately, give worse results on mixed set of images. It can be clearly seen on a 3D (Figure 6) and 2D plots (Figure 7 a b c) of models. Each axis corresponds to quality on one of sets: TS, MS or mixed set containing both sets. It is seen that most models have better quality on each of MS and TS sets, but have lower quality on mixed set. It means that models are quite good even with five features, however, these features are sensible to image content, so trying to use average weights affects quality of model. Moreover, in many cases feature weights for different sets have opposite signs.

Another interesting finding concerns putting all 57 features in one model which seriously affects result negatively and provides around 40-50% of corrected pairs which is almost as good as just random choice.

If we look at features contained in best five models, it can be seen that features contained in most models repeat results obtained with OLS regression. One of most important ones is entropy of whole image and its background and foreground on all levels of pyramid. Besides, blurriness, blockness, noise and average intensity and contrast occur in top models, which does not contradict to results obtained with OLS regression in Section 4.1.

In comparison of previous approach with known quality measures, Elo rating approach provides 24-27% of inverted pairs on separate sets, which is better than with linear regression. This should be so due to using initial comparisons matrix as a ground truth. As for quality for mixed set, we see that models are not able to provide good result because of difference in weights. We are giving a closer look on this question in next section.

Figure 6 Models in 3-dimensional space

Figure 7 a b c

3.4 Comparing feature weights

After obtaining sets of most important features our intention was to check for features invariant to scene and try to get a unique formula of quality based on separate models for both image sets. In addition, we tested the best models for each image set separately. Using initial comparisons matrix as a ground truth we trained linear classifier with binary outcome to check the results obtained at previous steps. The first part of this experiment aimed at training a model on one set and test on the other. If weights of features derived from the first image set were providing a good prediction for the second set as well, we would assume that the selected features provide a good representation of human image quality perception.

Second part was to check model performance on each set, and to get testing and training samples out of a mixed set to make sure that restricted number of features is able to provide acceptable results. For both parts, the main requirement was the use of linear classifiers according to previous assumption that quality of image depends on the image features linearly.

We were also using logistic regression classifier, which considers linear dependence between outcome and features. For every pair we use differences of features between left and right image and binary target variable, which equals 1 if left image wins. Scikit-learn library implementation of logistic regression classifier was used Implementation is described on their web-page . We studied model quality metrics such as accuracy score and area under curve to evaluate model performance and see whether selected features are able to provide good result. On final step we took best ten models of five features and performed a number of binary classification experiments using logistic regression classifier with intercept.

First part of experiment considered learning classifier on one homogenous set of images and testing on another. Results of these experiments demonstrate very low quality regardless number of features in model. Accuracy score is below 45%, precision and recall measures are close to 50% which is the same as random choice. This result was obtained for all experiments with same design. Example of feature weights for the same model learned on each set of images presented in Table 4 demonstrate that coefficients are different on sets.

Table 4 Features weights learned on each set




Mixed set





















F1 score




As for training and testing on same set of images, better results were achieved even with a five-feature set. For example, fifth model from Table 3 provides better results on both sets. It reaches average accuracy of 72% using random shuffle cross validation algorithm with 20% testing size on trees dataset and 71% accuracy score on medical dataset. On mixed dataset where examples of both sets were included into training and testing set, average accuracy score is about 59%.

Another sets of experiments considered models including all 57 features. In this case average accuracy score is 76% for mixed dataset, 80% for medical dataset and 77% for trees dataset. This demonstrates that the best models of 5 features contain most of the useful signal needed for classification.

If we train on one set and test on another one using all 57 features, model still gives only 50% of accuracy.

These results show that selected models containing restricted features set are good enough for both set of images. However, there is no universal formula of quality for both sets at once due to different weights of features.

Figure 8 ROC curves for five features classifiers

quality image intensity regression

Conclusion and further research

Using two datasets of very different nature, we identified the most important image quality factors explaining human perception of the image quality. We used two major approaches: first approach uses a vector of known quality measures that were obtained from initial comparisons using arbitrary formula of wins to comparisons rate. Second approach treats quality as unknown feature and tries to find values using raw comparisons matrix as source of information. Comparing these to major approaches based on their fraction of falsely-predicted pairwise comparisons (inverted image pairs), we obtained 29-30% for the first, and 24-27% for the second approach

We also observed that some factors were conceptually similar which enabled us to select a limited set of really important quality factors. In case of medical images, this is a very useful finding which enables us to interpret quality perception and not only to rank images by a number of features but also try to build a framework that improves particular image features. Such tool could be one of potential practical extension of this study.

Still we would like to extend and generalize the achieved results by validating them on more datasets. Another potential study limitation lies in the field of ranking and classifying images by quality. After increasing dataset of manually ranked images we could then conduct a comparison of ranking provided by neural network which can use a large number of all possible features and a classifier which uses a restricted set of most important features. However, such comparison would be fair if we use dataset of neutral monochrome images which makes it useful only for a specific field like medicine and medical images.

All things considered, our results demonstrate that image quality perception can be modeled with a small set of non-reference factors that are easy to interpret. This can definitely lead to new useful tools for image quality control.

Works cited

[1] Dolmiere T., Ladret P. Crete F., The Blur Effect: Perception and Estimation with a New No-Reference Perceptual Blur Metric. Grenoble: SPIE Electronic Imaging Symposium Conf Human Vision and Electronic Imaging, Йtats-Unis d'Amйrique, 2007.

[2] Serir A. Kerouh F., A no-reference blur image quality measure based on wavelet transform.: Digital Information Processing and Communications, 2012.

[3] K. De, A new no-reference image quality measure to determine the quality of a given image using object separability. Taipei: Machine Vision and Image Processing (MVIP), 2012 International Conference on, 2012.

[4] Monica P. Carley-Spencer Jeffrey P. Woodard, No-Reference image quality metrics for structural MRI.: Neuroinformatics, 2006, vol. 4.

[5] Chen F., Doermann D. Kumar J., "Sharpness estimation for Document and Scene Images," in Pattern Recognition (ICPR), 2012 21st International Conference on, Tsukuba, 2012, pp. 3292 - 3295.

[6] JA Bloom C Chen, A blind reference-free blockiness measure. Shanghai: in Proceedings of the Pacic Rim Conference on Advances in Multimedia Information Processing: part I, 2010.

[7] Masayuki Tanaka and Masatoshi Okutomi Xinhao Liu, Noise Level Estimation Using Weak Textured Patches of a Single Noisy Image.: IEEE International Conference on Image Processing (ICIP), 2012.

[8] Xinqi Zheng, Xuan Hu, Wei Zhou, Wei Wang Tao Yuan, A method for the evaluation of image quality according to the recognition effectiveness of objects in the optical remote sensing image using machine learning algorithm.: PLoS ONE, 2014.

[9] Apard E Elo, 8.4 Logistic Probability as a Rating Basis". The Rating of Chessplayers, Past&Present. NY, United States: Press International, 2008.

Source code

A. Elo rating approach

import scipy

import scipy.optimize

import itertools

import random

import math

import numpy

import pandas as pd

class LikelihoodCalculator:

def __init__(self, features, comparisons):

self.features = features

self.comparisons = comparisons

def getLogLikelihood(self, ratings):

logLikelihoodSum = 0.0

for (i1, i2, v) in self.comparisons:

print i1,i2

print len(ratings)

if abs(ratings[i2] - ratings[i1]) > 200.0:

logLikelihoodSum += -abs(ratings[i2] - ratings[i1]) if (v == 1) == (ratings[i2] > ratings[i1]) else 0.0


p = (1.0 / (1.0 + math.exp(ratings[i2] - ratings[i1])))

logLikelihoodSum += math.log(abs(1.0 - p - v))

return logLikelihoodSum

def getRatings(self, weights):

return [sum([weight * feature for (weight, feature) in itertools.izip(weights, features1)]) for features1 in self.features]

def updateDerivatives(self, weights, featuresA, featuresB, v, derivatives):

ratingA = sum([weight * featureA for (weight, featureA) in itertools.izip(weights, featuresA)])

ratingB = sum([weight * featureB for (weight, featureB) in itertools.izip(weights, featuresB)])

exp1 = math.exp(ratingB - ratingA)

value = 1.0 / (1.0 + exp1)

if exp1 > 1e50:

derivativeA = 1.0 / exp1


derivativeA = exp1 / (1.0 + exp1) ** 2

derivativeB = -derivativeA

for j in range(len(weights)):

derivatives[j] += (derivativeA * featuresA[j] + derivativeB * featuresB[j]) / (value + v - 1.0)

def __call__(self, weights):

weights = list(weights)

ratings = self.getRatings(weights)

value = self.getLogLikelihood(ratings)

derivatives = [0.0 for j in range(len(weights))]

for (a, b, v) in self.comparisons:

self.updateDerivatives(weights, self.features[a], self.features[b], v, derivatives)

print "Value: " + str(value)

return (-value, numpy.array([-d for d in derivatives]))

def findOptimalWeights(features, comparisons):

weightsCount = len(features[0])

weights0 = [random.random() for j in range(weightsCount)]

print 'START'

(weights, f, d) = scipy.optimize.fmin_l_bfgs_b(LikelihoodCalculator(features, comparisons), weights0)

print f

print d

return weights

def checkDerivative(obj, point, u):

print "Starting, point: " + str(len(point)) + ", u: " + str(u)

(initialValue, gradient) = obj.__call__(point)

gradient = list(gradient)

print "Starting, point: " + str(len(point)) + ", gradient: " + str(len(gradient)) + ", u: " + str(u)

print "Calculated derivative for u = " + str(u) + ": " + str(gradient[u])

for power in range(-7, -4):

delta = 10.0 ** power

pointWithDelta = point[:]

pointWithDelta[u] += delta

(value, gradient1) = obj.__call__(pointWithDelta)

print "delta: " + str(delta) + ", value: " + str(value) + ", derivative: " + str((value - initialValue) / delta)

def main():

features_df = pd.read_csv("/Users/nephidei/Documents/imgproc/final/reit/trees_features.csv", sep=';')

for nabor in [[3,15,2,50,6,11,21]]:

features_selected = features_df[nabor]

features = map(list, features_selected.values)

comparisons_df = pd.read_csv("/Users/nephidei/Documents/imgproc/final/reit/trees_comp_sure.csv", sep=';')

comparisons = map(list, comparisons_df.values)

for comp in comparisons:

comp[0] -= 1

comp[1] -= 1

print comparisons

weights = findOptimalWeights(features, comparisons)

print "Weights: " + str(weights)

okCount = 0

badCount = 0

for (i, features1) in enumerate(features):

rating = sum([weight * feature for (weight, feature) in itertools.izip(weights, features1)])

for (a, b, v) in comparisons:

ratingA = sum([weight * featureA for (weight, featureA) in itertools.izip(weights, features[a])])

ratingB = sum([weight * featureB for (weight, featureB) in itertools.izip(weights, features[b])])

if (ratingA > ratingB) == (v == 1):

okCount += 1


badCount += 1

print "OK: " + str(okCount) + ", bad: " + str(badCount)


B. Code for non-reference quality measures

Average gradient

function AG = avgGrad(image)

% original image F:

imageF = im2double(image);

[m, n] = size(imageF);

Gx = zeros(m-1,n-1);

for i=1:m-1

for j=1:n-1

a1 = imageF(i,j);

a2 = imageF(i+1,j);

a3 = imageF(i, j+1);

sum1 = ((a1-a2)^2 + (a1-a3)^2);

Gx(i,j) = sqrt((sum1/2));



C = 1/((m-1)*(n-1));

S = sum(Gx(:));

AG = C*S;


% blockness

% A Blind Reference-Free Blockiness Measure

function blockness = blockness(image, bl)

imageF = rgb2gray(image);

[m, n] = size(imageF);

% window width

w = 8;

% block size parameter: bl

% difference

diff_hor = abs(imageF(1:m-1, :) - imageF(2:m, :));

diff_vert = abs(imageF(:, 1:n-1) - imageF(:, 2:n));

% normalization

d_norm_hor = zeros(m,n);

for ii=1+w:m-w-1

for j=1:n

expr1 = sum(diff_hor(ii-w:ii+w,j).^2) - diff_hor(ii,j)^2;

koren = double(expr1 / (2 * w + 0.0))^0.5;

d_norm_hor(ii,j) = diff_hor(ii,j)/koren;



% horizontal profile

prof_hor = 1/n*(sum(d_norm_hor,2));

PH_values= zeros(m-1);

sum_FPH = 0.0;

for ii = 1:m-1

X = ii*(m-1)/bl-1.0;

sum_PH = 0.0;

for xi = 1:m-1

sum_PH = sum_PH + prof_hor(xi) * exp(-i*2*pi*xi*X/(m-1));


FPH = abs(sum_PH);

PH_values(ii) = FPH;

sum_FPH = sum_FPH + FPH^2;


bm_h = 1/sum(prof_hor(1:m-1))*sqrt((1/(bl-1))*sum_FPH);

% normalization vert

d_norm_vert = zeros(m,n);

for ii=1:m

for j=1+w:n-w-1

expr1 = sum(diff_vert(ii,j-w:j+w).^2) - diff_vert(ii,j)^2;

koren = double(expr1 / (2 * w + 0.0))^0.5;

d_norm_vert(ii,j) = diff_vert(ii,j)/koren;



% vertical profile

prof_vert = 1/n*(sum(d_norm_vert,1));

PH_values_vert= zeros(n-1);

sum_FPH_vert = 0.0;

for j = 1:n-1

X_vert = j*(n-1)/bl-1.0;

sum_PH_vert = 0.0;

for xj = 1:n-1

sum_PH_vert = sum_PH_vert + prof_vert(xj) * exp(-i*2*pi*xj*X_vert/(n-1));


FPH_vert = abs(sum_PH_vert);

PH_values_vert(j) = FPH_vert;

sum_FPH_vert = sum_FPH_vert + FPH_vert^2;

Подобные документы

  • Игра арканный симулятор гонок разработана: в среде Delphi 5 с использованием библиотеки OpenGL 1.3.4582, Pixia 2.4g для создания и редактирования текстур, Image Editor 3.0 для создания иконок, 3D-Stydio Max 5.0 для создания моделей машин (игрока).

    курсовая работа [34,1 K], добавлен 23.12.2007

  • Основные возможности Norton Ghost. Создание резервной копии и восстановление данных из нее. Основные возможности Paragon Drive Backup. Клонирование дисков и разделов. Пользовательский интерфейс Drive Image 6.0. Утилиты Image Explorer и Ghost Explorer.

    лекция [1,7 M], добавлен 27.04.2009

  • Программа "Labs", выбор шрифта с помощью элемента ComboBox. Очистка содержимого и добавление значений в элемент ListBox. Загрузка картинки в элементе Image. Совместная работа SpinButton и TextBox. Изменение масштаба надписи и текста элемента Label.

    лабораторная работа [3,1 M], добавлен 31.05.2009

  • Характеристика графических возможностей среды программирования Lazarus. Анализ свойств Canvas, Pen, Brush. Сущность методов рисования эллипса и прямоугольника. Возможности компонентов Image и PaintBox. Реализации программы "Графический редактор".

    курсовая работа [2,8 M], добавлен 30.03.2015

  • Сферы применения и возможности WordPress - CMS с открытым исходным кодом, распространяемой под GNU GPL. Уязвимости WordPress в плагинах Emaily, FeedList, WP Auctions и Old Post Spinner. Межсайтовый скриптинг WordPress в плагине Page Flip Image Gallery.

    реферат [4,1 M], добавлен 12.07.2012

  • Теоретичні відомості щодо головних принципів локалізації програмного забезпечення, основні технологічні способи його здійснення. Труднощі, пов`язані з цим процесом. Перекладацький аналіз україномовної локалізації програм XnView і VSO Image Resizer.

    дипломная работа [1,0 M], добавлен 16.07.2013

  • Структура сайта, характеристика процесса его создания. Необходимая кодировка, установка. Присоединение таблицы стилей к сайту. Окно специальных возможностей тега image. Разбор сайта на РНР блоки, создание базы данных. Доступ к админке по паролю.

    лабораторная работа [889,7 K], добавлен 09.01.2013

  • Дослідження логічних схем, їх побудови і емуляції роботи в різних програмних засобах, призначених для цього. Electronics Workbench 5 – розробка фірми Interactive Image Technologies, її можливості. Рівні бази Multisim. Ключові особливості Proteus.

    курсовая работа [2,0 M], добавлен 23.08.2014

  • Обзор технологий резервного копирования. Восстановление данных из резервных копий. Разновидности программ резервного копирования: GFI Backup, Paragon Drive backup Workstation, Acronis True Image. Применение и сравнение рассмотренных программных продуктов.

    курсовая работа [3,0 M], добавлен 29.01.2013

  • Основні технологічні способи здійснення локалізації програмного забезпечення: SDL Passolo, Lingobit Localizer, OmegaT, Pootle, Narro. Перекладацький аналіз україномовної локалізації програм XnView і VSO Image Resizer. Граматичні та лексичні трансформації.

    дипломная работа [1,3 M], добавлен 25.02.2014