DEEP CONVOLUTIONAL NEURAL NETWORKS FOR SENTIMENT ANALYSIS OF CULTURAL HERITAGE
DEEP CONVOLUTIONAL NEURAL NETWORKS FOR SENTIMENT ANALYSIS OF CULTURAL HERITAGE
M. Paolanti2∗, X. Xxxxxxxxx0, X. Xxxxxxx0, X. Xxxxxxxxx0, E. S. Malinverni1, X. Xxxxxxxx0, P. Zingaretti2
1Universita´ Politecnica delle Marche, Dipartimento di Ingegneria Civile, Edile e dell’Architettura, 00000 Xxxxxx, Xxxxx (x.xxxxxxxxx, x.x.xxxxxxxxxx)@xxxxx.xxxxxx.xx
2Universita´ Politecnica delle Marche, Dipartimento di Ingegneria dell’Informazione, 60100 Ancona, Italy (x.xxxxxxx, x.xxxxxxxxx)@xx.xxxxxx.xx, (x.xxxxxxxx, x.xxxxxxxx, x.xxxxxxxxxx)@xxxxx.xxxxxx.xx
Commission II, XX XX/8
KEY WORDS: Cultural Heritage, Sentiment Analysis, Deep Convolutional Neural Network
ABSTRACT:
The promotion of Cultural Heritage (CH) goods has become a major challenges over the last years. CH goods promote economic development, notably through cultural and creative industries and tourism. Thus, an effective planning of archaeological, cultural, artistic and architectural sites within the territory make CH goods easily accessible. A way of adding value to these services is making them capable of providing, using new technologies, a more immersive and stimulating fruition of information. In this light, an effective contribution can be provided by sentiment analysis. The sentiment related to a monument can be used for its evaluation considering that if it is positive, it influences its public image by increasing its value. This work introduces an approach to estimate the sentiment of Social Media pictures CH related. The sentiment of a picture is identified by an especially trained Deep Convolutional Neural Network (DCNN); aftewards, we compared the performance of three DCNNs: VGG16, ResNet and InceptionResNet. It is interesting to observe how these three different architectures are able to correctly evaluate the sentiment of an image referred to a ancient monument, historical buildings, archaeological sites, museum objects, and more. Our approach has been applied to a newly collected dataset of pictures from Instagram, which shows CH goods included in the UNESCO list of World Heritage properties.
1. INTRODUCTION
A monument is a structure that was built to commemorate a per- son or an event and that takes on historical, cultural, religious, identity importance of a people (Xxxxxxx et al., 2001). Further- more, the monuments are among the main tourist destinations of a place, also affecting the economy of the country. In Italy, the Cultural Heritage (CH) is huge and there is an urgent need to preserve, protect, maintain security, manage without waste and above all enhance this heritage, for Italians, foreigners and future generations. For this purpose it is essential to have an evalua- tion of the cultural heritage and its state, which cannot be built without good documentation (Szmelter, 2013). We can include the sentiment related to the monument in the evaluation, consid- ering that this affects its public image and increases its value. Then it is necessary that the aforementioned information has a specific structure in order to be correctly maintained, success- fully searched, and used with ease. The digital documentation is the process that creates the documentation and is divided into two steps: first data acquisition and storage and subsequent data classification in order to create the structuring (Remondino and Xxxxx, 2010). Despite the pressure from international organiza- tions, a standardization regarding the documentation and the pro- cess has not yet been reached. In any case, among the absolute most common sources of documentation there are photographs, drawings, images of all kinds, and coming from every sensor or technology, in enormous quantities. The many images made by non-professionals but easily available on Social can be a source of complementary documentation useful from various points of view, especially if structured through Sentiment Analysis (SA). The protection and promotion of Cultural Heritage (CH) goods
are major challenges of these years (Xxxxxxxx, 2017). In this respect, monuments and historical buildings constitute primary means by which tangible and intangible testimonies of nature and human cultures are safeguarded (Ferretti and Comino, 2015). These represent cultural transmission, learning, intercultural dia- xxxxx, discussion and training, also these play an important role in education (formal, informal, and lifelong learning), social cohe- sion and sustainable development. CH goods promote economic development, notably through cultural and creative industries and tourism. These aspects draw the attention to the importance of the promotion of CH goods and collections. Thus, an effective plan- ning of archaeological, cultural, artistic and architectural sites within the territory make CH goods easily accessible (Colace et al., 2015).
A way of adding value to these services is making them capa- ble of providing, using new technologies, a more immersive and stimulating fruition of information (Lops et al., 2009). In partic- ular, tourists, visitors, citizens equipped with appropriate devices easily create and share their ideas, opinions and contents with millions of other people around the world with social networks. In this light, an effective contribution can be provided by sen- timent analysis (Xxxx et al., 2008). The sentiment related to a monument can be used for its evaluation considering that if it is positive, it influences its public image by increasing its value. The main purpose of SA is the identification of the agreement or dis-agreement pictures that deal with positive or negative feel- ings. Besides, Instagram provides a rich repository of images and captions that are associated with users sentiments (Xxxx and Xx, 2015, Xxxx et al., 2015). These images do not only reflect people social lives, but also express their opinions about places, xxxx- ments and territory. Social media pictures represent a rich source of knowledge to understand users opinions.
This work introduces an approach to estimate the sentiment of social media pictures CH related. The sentiment of a picture is identified by an especially trained Deep Convolutional Neu- ral Network (DCNN).
The DCNN is based on a VGG16 network architecture (Xxxxxxxx and Xxxxxxxxx, 2014) and it is trained by fine-tuning a model pretrained using the ImageNet dataset (Xxxxxxxxxx et al., 2012), (Xxxxxxxxx et al., 2018). Fine tuning, which is a very simple transfer learning method, is implemented by exploiting the pre- trained neural network for any other task to initialize the weights of the network, except the last ones, instead of the random ini- tialization. Furthermore, we compare the performance of the VGG16 network with other DCNNs used for image classifica- tion: ResNet (He et al., 2016) and InceptionResNet (Szegedy et al., 2017). The evaluation has been performed in Keras, a TensorFlow-based python module, which allows the implementa- tion of neural network models and training them. Since the tem- plates are pre-trained on 1000 classes, it was necessary to modify the last layers of the networks before proceeding with the fine tuning.
The approach has been applied to a newly collected dataset “Cul- tural HeRItage Sentiment” (XXXXX) Dataset of daily news pic- tures from Instagram, which show CH goods included in the UN- ESCO list of World Heritage properties. To the best of our knowl- edge, it is the first study on sentiment analysis of CH-related pictures on Instagram. We utilized the captions of the Insta- gram posts to pre-select images that have detectable sentiment content about well-known CH good (such as Tour Eiffel, Colos- seum and so on). Typically, the image captions indicate the users sentiment for the uploaded images. To obtain the ground truth of the collected pictures, the true sentiment has been manually estimated by human annotators, thus providing a more precise and less noisy dataset compared to automatically generated labels from image captions or hashtags. Since sentiment estimation is a subjective task where different people may assign different senti- ments to images, we asked two persons to judge the sentiment of the images and measured their agreement. The inter-annotator- agreement is a common approach to determine the reliability of a dataset and the difficulty of the classification task (Xxxxxxxx et al., 2008), (Xxxxxxxx et al., 2017).
The paper is organised as follows: Section 2 is an overview of the research status of sentiment analysis approach for cultural her- itage field; Section 3 introduces more specifically our approach, describing the dataset purposely created, the CNNs visual classi- fier and also the performance evaluation metrics used for analysing the ability of the three different CNNs employed. Section 4 shows the results obtained; and Section 5 discusses the conclusions and future works.
2. RELATED WORKS
The sentiment analysis, or mathematical/statistical methods that analyze information, aims to quantify the intensity (positive, neg- ative and neutral) of a sentiment described, in our case, in an image posted in a social network. The added value of senti- ment analysis with respect to the usual customer satisfaction tech- niques lies in the fact that it is about listening to emotions that are spontaneously provided and for this reason they reflect the real expectations and moods of users. SA makes it possible to trans- form the immense amount of data generated by social media into knowledge. The analysis of the perception of a CH by citizenship becomes, therefore, an unparalleled tool of territorial marketing and a political indicator that can guide and direct choices and ac- tions. The data sources analyzed during the study are innovative
and alternative to those on which the classic customer satisfaction methods are based.
According to (Xxxxxx et al., 2017), the visual sentiment analysis is a growing area of research since images by capturing moments related to deep feeling have became an important component of our digital social life. In the work, they used the Twitter dataset collected, called DeepSent, to train and evaluate a CaffeNet CNN architecture to recognize the polarity (positive and negative) of a visual sentiment.
In (Xxxxx et al., 2017) the authors investigate whether and which types of objects are most responsible for evoking emotions and feelings within an image. It occurs if the sentiment correspond- ing to the “regions of salience” agrees with the general sentiment expressed by the entire image. The dataset is partitioned by divid- ing the image according to meta-attributes at the scene level such as indoor-outdoor, natural-manmade, face-noface, and more, be- fore providing the sentiment classification in only two positive- negative categories. The value of accuracy is over 82% using pre-trained CNN for image recognition. In conclusion, the at- tributes that tend to dominate the perception of sentiment by the subject are first of all the faces, meaning that in many cases we can consider facial expressions without paying attention to other objects within the image.
In (You et al., 2015), the authors have explored different applica- tion of CNN for visual sentiment prediction and then have pre- sented a CNN for the same purpose, demonstrating that their pro- posed is better than the state-of-art architectures. They propose a training method called progressive learning, which addresses the problem of data noise, filtering them from the training set. The basic idea is that in the SA each class contains images that are so different that it is extremely difficult to discover features that can characterize the classes, and more people could have opposing sentiments for the same image. For this it becomes necessary a supervised learning engine able of tolerating a significant level of noise in the training set, such as the progressive one.
In (Xxxxxxx et al., 2012), the idea is to use the “game with a purpose” paradigm as a source of crowdsourcing annotation, in which users, as a side effect of the game, perform the annota- tion work. This strategy goes alongside the more common one that exploits the famous Amazon Mechanical Turk crowdsourc- ing platform, where workers can be took as annotators. It also refers to the problem of background knowledge: the high level of abstraction required by the SA may require basic knowledge, context and history, which go beyond the strict visual content.
In (Xxx et al., 2016), a fundamental point of the CNN is as- serted: they can well perform the direct mapping between vision and sentiment, by deducing directly the sentiment from the visual content, without the need to construct mid-level representations, which are provided in automatic way from the networks them- selves. Using the same dataset, they compare the performance of three architectures: GoogleNet, VGGNet and RESNet, demon- strating that the latter works better than the first two, even if the specific dataset plays a fundamental role.
In (Xxxxxx and Xxxxx, 2015) a sentiment rating based on seven “votes” which include the neutral and different degrees of inten- sity of sentiment of the same polarity was proposed, on the idea that the strength of sentiment is as important as polarity, beyond that better than the fine-grained categorization. The authors have trained an Alexnet with 5 convolutional layers and 3 fully con- nected through a strategy that has led to an accuracy of 53.5%. From the point of view of visual sentiment for cultural heritage we cited some works that have been an useful guide for our work. In (Xxxxx et al., 2017), the authors achieved an accuracy of 92.7% by training a network to recognize 100 particular Indian mon-
Figure 1. DCNNs for sentiment classification.
uments through a dataset of 5,000 image. The great variety of background images, points of view, monuments architecture, and more, is a noise that degrades the accuracy. To solve this problem, the authors used clean images that are manually cut to minimize noise. To capture the different characteristics of the data, they then trained 3 CNNs of Alex-Net architecture, concatenating the outgoing representations from all 3 into a single final vector. The best result was achieved by excluding the FC7 layer from the ar- chitecture.
In (Xxxxxx et al., 2017) the authors illustrate the importance of digital documentation for architectural cultural heritage. They have collected a dataset of 10,000 images to be grouped accord- ing to the constituent elements of the monuments, and therefore they have created others with the same images but scaled accord- ing to different dimensions to estimate the compromise between performance an d time. Comparing the performance of full train- ing and fine tuning on different networks, using the fine tuning of the Inception-ResNet-v2 network the best accuracy of 93% was achieved.
In (Xxxxxx et al., 2017) the authors show views of the local pat- terns learned from the network associated with sentiment in or- der to see how sentiment is perceived by the model. They have purposely introduced ambiguous annotations in order to develop a model that is reliable even with this kind of noise on the la- bels. They also analyzed the impact of initialization of weights by varying the source domain within transfer learning.
3. MATERIALS AND METHODS
The huge amount of images shared on Instagram, together with its recent widespread expansion in Italy has led us to use images of this social network for the CH. Users upload an image, write a caption to their liking that often describes an experience or opin- ion, and finally tag it with some hashtags and publish it on their profile. There are also generic profiles that present themselves as disseminators of a particular theme. The hashtags are now very popular and there is no shortage of hashtags inherent in cultural heritage, such as hashtags related to emotions and feelings. They can be generic as well as specific to the single monument. Fi- xxxxx there are also locations of the monuments, which translate into the unique IDs that Instagram assigns to all the images tagged with a certain geotag. In this section, we introduce the framework as well as the novel social media XXXXX xxxxxxx collected for this work and used for evaluation. The framework is depicted in the Figure 1. Further details are given in the following subsections.
3.1 The CNNs for visual sentiment classifier
Convolutional neural network (CNN or ConvNet) is one of the most common algorithms for deep learning, a type of machine
learning in which a computer model learns to perform classifica- tion activities directly from images, videos, text or sounds. CNNs are particularly useful for finding patterns in images to recog- nize objects, faces and scenes. They learn directly from image data, using patterns to classify images and eliminating the need for manual feature extraction. Moreover, CNNs can be re-trained for new recognition activities, allowing existing networks to be exploited. A CNN can have tens or hundreds of layers, each of which learns to detect the different features of an image. For each image, in different resolutions, filters are applied and the output of each image is used as input for the next layer. Filters can be initially very simple features, such as brightness and edges, to take on increasingly complex shapes that uniquely define the ob- ject. Like other neural networks, a CNN is composed of an input layer, an output layer and various layers hidden in the middle. These layers perform operations that alter the data in order to learn the specific features of the data. Three of the most common layers are: convolution, activation or ReLU and pooling.
•
Convolution: applies a series of convolutional filters to the input images, each of which activates certain features of the images.
•
Rectified Linear Unit (ReLU): enables faster and more ef- fective training by mapping negative values to zero and main- taining positive values. Only activated features are passed to the next layer.
•
Pooling: simplifies output by performing non-linear sub- sampling, reducing the number of parameters the network needs to learn.
These operations are repeated on tens or hundreds of layers, each of which learns to identify different features. After learning the features in various layers, the architecture of a CNN moves to classification. The penultimate layer is a fully connected layer that generates a vector of K dimensions where K is the number of classes that the network will be able to predict. This vector contains the probabilities for each class of any classified image. The final layer of the CNN architecture uses a classification layer like softmax to provide the classification output.
3.1.1 VGG-16 network The first CNN we considered is a VGG-16 net (Xxxxxxxx and Xxxxxxxxx, 2014). The VGG-16 net- work is chosen because easy to implement and its success in the ILSVRC-2014 competition where it placed first in the 2a chal- lenge. Originally VGG-16 is trained on the ImageNet database consisted of of labeled images in 1000 classes (Xxxxxxxxxx et al., 2012) and is a very deep, 16-convolutional-layer network. The VGG-16 network consists of 5 convolutional blocks where cor- responding output filter have [54, 128, 256, 512] dimensions and
×
a fully-connected classifier. A characteristic element of the VGG architecture is also represented by the introduction of 1x1 size convolution filters, which in general are used to make a reduction in the space of the channels while leaving unaltered the dimen- sion of the output; the meaning of this operation lies in the fact that since each convolution layer is followed by a ReLU layer, the insertion of these filters allows to model further non-linear effects. We developed the VGG16 model in Keras, a high-level neural networks library and written in Python. Our implemen- tation for VGG16 net refers to the work in (Xxxxxxxx and Xxx- xxxxxx, 2014). The image is resized to 224 224 pixel. Since there is not batch normalization layer in VGG16, input images are been normalized. We use the stochastic gradient descent (SGD) optimizer with a batch size of 5. After preliminary experiment the learning rate is very small and fixed to 10−5. The network is trained with a binary cross entropy (BCE) by training for 100 epochs.
·
−
3.1.2 ResNet network The second network are the residual networks (ResNet) (He et al., 2016), an innovation in the field of convolutional neural networks, winner of the ImageNet com- petition in 2015, with a top-5 error of 3.75%. The key question that the developers asked themselves, thinking about the model of this network was: why does each deep network have a worse performance as layers are added? The hypothesis that the authors of ResNet did, was that direct mappings are difficult to learn. So they proposed a change: instead of trying to estimate a function G( ) that given an x returns G(x), it is better to learn the difference between the two, also called residual, hence the name of the net- work. Consequently, to calculate G(x) starting from x its residue must be added: F(x) = G(x) x. F(x) is the residual, and there- fore the network instead of directly learning G(x), will try to learn F(x) + x. Thus the ResNet blocks (residual network) were intro- duced into the network. Each ResNet block consists of a series of layers and an identity mapping that adds the input of the block to its output. This “addition” operation is done element by element and if the input and output have different sizes, zero-padding or projection techniques (through 11 convolutions) can be used to create corresponding dimensions. Therefore, ResNet provides layers with a reference point x from which to learn, rather than starting from zero with an identity transformation. Before this idea, in the deep neural networks there was a problem of can- cellation of the gradient, whose descent, given by the minimiza- tion of the error function, is reduced exponentially through the retropropagation of the previous layers. In essence, the long road through the previous layers made the signs of errors so small that they did not allow the network to learn. Thanks to the innova- tion introduced by ResNet it is now possible to build networks of innumerable layers (potentially even more than a thousand) with a high degree of accuracy. In our work, after preliminary exper- iment ResNet has a learning rate small and fixed to 10−6. The network with a batch size of 16 is trained for 50 epoch.
3.1.3 Inception network If ResNet focuses on depth, Incep- tion network is instead focused on the extension. For Inception, the authors were interested to the computational efficiency of the larger networks learning. The original work concerned a compo- nent known as the Inception model. An Inception module pro- cesses multiple and different transformations on the same input data simultaneously, linking the results into a single output. In other words, each Inception module layer consists of a 55 con- volution, a 33 and a max-pool. The selection of the most rele- vant features is left to the next layer. However, the greater in- formation density of the architecture in this model had a great problem, namely the drastic increase in the necessary computa-
tional capacity. Not only the wider convolutional filters (55) are demanding and expensive to calculate, but also the overlapping of different filters increases the number of feature maps for each layer. This is a real impediment for the model. The authors of In- ception have used 11 convolutions to filter (or reduce) the depth of the outputs. These convolutions take into account one value at a time, but through multiple channels, can extract spatial in- formation and compress them to a smaller size. By reducing the number of input maps, the authors were able to overlap differ- ent transformations in parallel of layers, with the result of having networks that were at the same time deep (with numerous layers) and wide (with parallel operations). The first version of Incep- tion, called “GoogLeNet” is composed by 22 layers and won the 2014 ILSVRC competition. Inception v2 and v3, developed a year later, are better than the previous version for several factors: the most relevant it is the refactoring of large convolutions into smaller and consecutive ones, which are therefore easier to learn.
3.2 Cultural HeRItage Sentiment (XXXXX) Dataset
First of all we chose to classify the sentiment according to the most common subdivision into 3 classes: positive, neutral and negative. The definition of the sentiment has been affected by the characteristics of the dataset that has been achieved. Normally we look for images according to the characteristics we want to xxx- lyze, while in this case we have adapted the definitions of positive and negative based on the majority of the images obtained. The next two phases were: the subjectivity classification and the po- larity classification, both carried out by manually selecting the images.
The first phase, once the dataset has been acquired, has resulted in the attempt to define the “neutral”, or to separate neutral exam- ples from those containing any kind of sentiment. Through this phase we have tried to limit the problem of the subjectivity of the researcher in the assessment of sentiment: not having access to crowdsourcing platforms, in fact each phase was completed by a single person and the images that are in uncontrolled conditions, embedded, and non-iconic scenes were not an advantage. In the second phase, positive examples were finally separated from neg- ative examples, among those that were most common in the im- ages. The results have been divided into three categories and also figure 2 shows three examples of pictures in the XXXXX Xxxxxxx:
•
positive: selfies, tricolor arrows, fireworks, individuals and groups posing for the photo or in the act of photographing, soap bubbles, bright Christmas decorations, fresh flowers in the foreground, flags deployed or flying, large crowd (to the edge image), light projections on the monument, kisses, rainbows, objects that “imitate” the monument.
•
negative: rain and therefore crowd with umbrellas or reflec- tions of the monument in the puddles on the ground, snow- falls, smoke, the “head in the clouds” phenomenon for the towers or when the point disappears above the clouds or the fog, planes flying near the towers, images with tram wires in the foreground, scaffolding on the monument or on adja- cent ones, queues of people or traffic, military parades and law enforcement in general, red cross, barriers, bottles of alcohol, dirt on the ground, demonstrations and disorderly protests. A special case for the tower of Pisa: poses that portray kicks or obscene behavior towards the tower.
•
neutral: close-up fragments of the monument, large-scale images, regular life scenes around the monument, and above all the co-presence of positive and negative sentiment in the
same image. A case apart this time it was considered for the Victorian: the monument guards were considered negative if armed, but neutral if not armed.
(a) Positive (b) Neutral (c) Negative
Figure 2. Images from XXXXX Xxxxxxx. Figure 2a is an example of positive image, Figure 2b represents an image with neutral sentiment, and Figure 2c is a picture with negative sentiment.
The following phase of preparing the dataset concerned the pro- cess of acquiring and labeling the dataset. Before defining the guidelines stably, it was decided to verify the set and the types of images circulating on Instagram by downloading from profiles, hashtags and locations. For hashtag search most popular on the social network, the help of the keywordtool1 online tool was used. At first, images were downloaded via generic hashtags, then gen- eralizing to hashtags that have the name of the monument many images were obtained, but with a problem related to the difficulty of finding negative examples. To overcome this problem, geotags have been downloaded for some monuments, obtaining better re- sults. the advantage is due to the fact that with this strategy it is possible to obtain photos that do not have hashtags of the xxxx- ment and therefore otherwise unavailable. We initially obtained a collection of more than 17500 images, considered suitable, fil- tered only from those examples that had a multiframe structure and then the annotative phase was started, carried out as yet said in a first part of subjectivity classification and in a second one po- larity classification. The biggest problem in the annotative phase was to consider very detailed guidelines, which led to the loss of consistency between the classes. The biggest problem in the annotative phase was to consider very detailed guidelines, which led to the loss of consistency between the classes. The resulting dataset was rich but strongly unbalanced. By slightly changing the guidelines, many images ranged from positive to neutral sen- timent and vice versa but the real problem remained the negative examples. Only negative examples were acquired through hash- tags and keywords in the caption. At this point some considera- tions were made:
•
Training is better with smaller and more reliable dataset than large and ambiguous ones.
•
Often the cost of missing a minority class is much greater than missing a majority class.
Positive Neutral Negative Total
2960 2891 2626 8477
Table 1. Final dataset.
reliable dataset from the one acquired, discarding the examples considered more ambiguous. So in conclusion we have filled the neutral and positive class so as to balance the negative one. In this way the final dataset obtained is composed as in the table 1. The final dataset is comprised of a total amount of 8472 pictures, including 2960 pictures with overall positive sentiment, 2891 pic- tures with overall neutral sentiment and 2626 pictures with over- all negative sentiment. We perform the experiments by splitting the labeled dataset into a training set and a test set.
The dataset had to be pre-processed in order for the images to adapt to the input sizes of the various networks. Some images have been deleted because, although well viewable through the operating system, not correctly recognized and processed.
Three versions of the dataset have been created, respectively from 224x224, 299x299 and 331x331 pixels. However with the larger dimensions, probably due to the memory limits imposed, the ex- ecution environment crashed so all the networks were fed with 224x224 pixels inputs.
3.3 Performance Evaluation Metrics
To evaluate the performance of the algorithms the following quan- tities were defined (Xxxxxxxxxx et al., 2010):
- tp (True Positive), the number of images correctly classified in the class of belonging;
- tn (True Negative), the number of images correctly classified in another class;
- fn (False Negative), the number of images incorrectly clas- sified in other classes;
- fp (False Positive), the number of images incorrectly classi- fied in the class;
We have employed the following metrics to compare the perfor- xxxxx of the algorithms that used for the evaluation of the image sentiment:
•
Accuracy: indicates the effectiveness of the algorithm by showing the probability of the true value of the class (posi- tive, neutral, negative):
•
Performance decreases when the task becomes too fine grained, which happens when the discriminating elements of the sen- timent are too small.
Accuracy
tp + fn
= tp + tn + fp + fn (1)
•
The problem of subjectivity also lies in the influence that the photo has on the human being, or when it is not clas- sified according to objective events, but by the quality and aesthetic impact of the image.
As a result we have considered appropriate not to use all the im- ages for training, but it was considered correct to cut out a more
1xxxxx://xxxxxxxxxxx.xx/xxxxxxxxx
where tp is the number of true positives and fn the number of false negatives.
•
Recall: is a function of its correctly classified examples (true positives) and its incorrectly classified examples (false neg- atives).
recall = tp (2)
tp + fn
•
Precision: is a function of true positives and examples in- correctly classified as positives (false positives).
precision = tp
tp + fp
• F1-score: is a measure of a test’s accuracy.
(3)
F1 −
score =
(β 2 + 1) ∗ precision ∗ recall
β 2 ∗ precision + recall
(4)
The F1-score is evenly balanced when β = 1. It favours precision when β > 1, and recall otherwise. The F1-score can be interpreted as a weighted average of the precision and recall.
•
Support: is the number of occurrences of each class in ground truth (correct) target values.
The confusion matrix is used With the aim of schematizing the results of the model. In fact, the confusion matrix depicts infor- mation about actual and predicted classifications done (Xxxxxxx and Xxxxxx, 1998). In the field of artificial intelligence, the con- fusion matrix, also called the wrong classification table, returns a representation of the statistical classification accuracy. Each col- umn of the matrix represents the predicted values, while each row represents the real values.
4. RESULTS AND DISCUSSION
In this section, we report the results of the experimental phase conducted on XXXXX Xxxxxxx. The dataset comprises 8477 im- ages containing visual elements. In particular, the dataset in- cludes “embedded scenes”, i.e. images that in addition to the object of interest (e.g. a monument) contain other objects that di- rectly influence sentiment (e.g. happy people, waste, vandalism, and so on). The true sentiment is not automatically judged by the accompanying texts or hash-tags but has been manually estimated by human annotators, thus providing a more precise dataset.
The experiments are based only on these images of the dataset, where both annotators have agreed on the sentiment. By remov-
Figure 4. Confusion matrix for ResNet network (learning rate=0.000001, batch=16, epochs=50.
Figure 5. Confusion matrix for ResNetInceptionV2 network.
The application of our approach to this dataset yields good results in terms of precision, recall and F1-score and demonstrates the effectiveness of the proposed approach (Table 2).
ing pictures with ambiguous sentiment, we increase the quality of
the dataset and ensure the validity of the experiments. The dataset | DCNNs | Accuracy | Precision | Recall | F1-score |
is split into 80% training and 20% test images. | VGG16 | 0.65 | 0.67 | 0.65 | 0.66 |
ResNet | 0.70 | 0.70 | 0.70 | 0.70 | |
IncResNet | 0.57 | 0.62 | 0.57 | 0.52 |
Figure 3. Confusion matrix for VGG16 network (learning rate=0.00001, batch=5, epochs=100).
Table 2. Overall classification results comparing the three different DCNNs.
Looking at the confusion matrices in Tables 3,4 and 5 and the fi- nal accuracy in Table 2, the first consideration concerns the fact that the ResNet50 with 70% provided the best result. It must be remembered that, given its speed, more tests have been con- ducted on this network. However, an interpretation could be the usefulness of the skip connections in transferring less “advanced” features to the following levels. A second consideration concerns the fact that the VGG16 network has reached 65% accuracy, with a more precise dataset, albeit with a smaller number of images. This confirms the above: better smaller and more reliable datasets that are larger and noisier. The third consideration concerns pre- processing and as a demonstration shows some of the various ex- amples that were really compromised by the crop and resize pro- cedure, which effectively changed the sentiment. In fact, it can
happen that figures have a positive sentiment but they were clas- sified as neutral since the pre-processing phase has filtered the discriminating property of the sentiment. The fourth considera- tion concerns the category that has been most highly misclassi- fied: from all three confusion matrixes on the final dataset it is clearly seen from the highlighted cases that the majority of the errors was predicting neutral sentiment for a negative image, see for example Tables 3 and 5.
Generally, positive examples have individuals and objects in the foreground much more than the other two classes, while neutrals are distinguished mainly by not having them. Finally, the neg- ative examples are characterized by having many discriminatory attributes for sentiment small in terms of pixels (for example bar- riers and non-central scaffolding in the image) of which the resiz- ing and cropping operations have further compromised visibility, to the eye human as to the artificial network. Finally we can say that overall the networks generalize quite correctly: only in one case we saw a clear example of overfitting as figure 6 shows. Therefore, the experiment of introducing a dropout layer into the architecture did not bring any noteworthy results.
(a) VGG16 (b) ResNet (c) InceptionV2 Figure 6. Accuracy for training and validation phases.
5. CONCLUSIONS AND DISCUSSIONS
The promotion of Cultural Heritage (CH) goods is an important challenge in the last years. Many are the aspects that must be con- sidered: cultural transmission, learning, intercultural dialogue, discussion and training, that play an important role in educa- tion (formal, informal, and life-long learning), social cohesion and sustainable development. CH goods promote economic de- velopment, notably through cultural and creative industries and tourism. A way of adding value to these services is making them capable of providing, using new technologies, a more involving and stimulating use of information.
In this paper, we introduce a deep learning approach for recogniz- ing the sentiment of cultural heritage pictures by taking only vi- sual information into account. The sentiment of a picture is iden- tified by a machine learning classifier based on visual features extracted from especially trained DCNNs. The experiments on a purposely created dataset compared the performances of three DCNNs (VGG16, ResNet and ResNetInceptionV2) on the sen- timent recognition. For this purpose an efficient contribution is provided by sentiment analysis. The main purpose of sentiment analysis is the identification of the agreement or disagreement pictures that deal with positive or negative sentiment. For exam- ple, instagram images also express their opinions about places, monuments and territory and so they represent a rich resource to understand users opinions. The experiments on the purposely created XXXXX Xxxxxxx yield high accuracies and demonstrate the effectiveness and suitability of our approach. To briefly summa- rize, the main contributions of this work are: (i) a demonstration that Deep Learning architectures can be applied for sentiment analysis evaluation of Social Media Pictures CH related; (ii) a challenging new dataset of images collected by Instagram of CH goods in the List of UNESCO, hand-labelled with ground truth;
(iii) performance comparison of different DCNNs for image clas- sification; (iv) a system that ensures the management of data with a multimedia and multidisciplinary approach through sentiment analysis techniques, to allow a CH analysis finalized to planning and distribution needs.
ACKNOWLEDGEMENTS
The authors would like to thank Xxxxx Xxxxxxxxx who made an invaluable work for her thesis.
REFERENCES
Xxxxxxx, M., Xxxxxxxx, C., Xxxxx, X. and Xxxx, P., 2012. From tags to emotions: Ontology-driven sentiment analysis in the social se- mantic web. Intelligenza Artificiale 6(1), pp. 41–54.
Xxxxxxxx, X. X., Xxxxx, P. and Xxxx, X., 2008. An agree- ment measure for determining inter-annotator reliability of hu- man judgements on affective text. In: proceedings of the work- shop on human Judgements in Computational Linguistics, Asso- ciation for Computational Linguistics, pp. 58–65.
Xxxxxx, X., Xxx, B. and Giro-i Xxxxx, X., 2017. From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction. Image and Vision Computing 65, pp. 15–22.
Xxxxxx, F., Xx Xxxxx, M., Xxxxx, S., Xxxxxxxx, M., Xxxxx, X., Xxxxxxxxxxx, X., Xxxxxxxxx, A. and Xxxxxxxx, M., 2015. How to describe cultural heritage resources in the web 2.0 era? In: 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), IEEE, pp. 809–815.
Xxxxxxxx, X. and Xxxxxx, E., 2015. An integrated framework to assess complex cultural and natural heritage systems with multi-attribute value theory. Journal of Cultural Heritage 16(5), pp. 688–697.
Xxxxxxxx, X., 2017. Protecting cultural heritage during armed con- flict: the italian contribution to cultural peacekeeping. Modern Italy 22(1), pp. 1–17.
Xx, X., Xxxxx, X., Xxx, S. and Xxx, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE con- xxxxxxx on computer vision and pattern recognition, pp. 770–778.
Xxxxxx, X. and Xxxxx, S., 2015. Image sentiment analysis us- ing deep convolutional neural networks with domain specific fine tuning. In: 2015 International Conference on Information Pro- cessing (ICIP), IEEE, pp. 447–451.
Xxxxxxxxxx, X., Xxxxxxxxxxx, C., Xxxxxxxx, X., Xxxxxxx, X. and Xxxxxxxxxx, P., 2010. Performance evaluation of automated ap- proaches to building detection in multi-source aerial data. ISPRS Journal of Photogrammetry and Remote Sensing 65(1), pp. 123– 133.
Xxxxxxxxxx, X., Xxxxxxxxx, X. and Xxxxxx, X. X., 2012. Imagenet classification with deep convolutional neural networks. pp. 1106– 1114.
Xxxxxx, X., M Lerones, P., Xxxxxx, R., Xxxxxx, E. and Go´mez- Garc´ıa-Xxxxxxx, J., 2017. Classification of architectural heritage images using deep learning techniques. Applied Sciences 7(10), pp. 992.
Xxxx, P., de Xxxxxx, M., Xxxxxxxx, G., Xxxxx, X., Xxxxxxxx,
F. and Xxx, M., 2009. A semantic content-based recommender system integrating folksonomies for personalized access. In: Web Personalization in Intelligent Environments, Springer, pp. 27–47.
Xxxxxxx, X. X. xx al., 2001. Landscapes, memory, monuments, and commemoration: Putting identity in its place. Canadian Eth- nic Studies 33(3), pp. 39–77.
⃝
Xxxx, B., Xxx, L. et al., 2008. Opinion mining and sentiment anal- ysis. Foundations and Trends R in Information Retrieval 2(1–2), pp. 1–135.
Xxxxxxxx, M., Kaiser, C., Xxxxxxxxx, R., Xxxxxxxx, E. and Xxx- xxxxxxx, P., 2017. Visual and textual sentiment analysis of brand- related social media pictures using deep convolutional neural net- works. In: International Conference on Image Analysis and Pro- cessing, Springer, pp. 402–413.
Xxxxxxxxx, R., Xxxxxxxxxx, E., Xxxxxxxxx, F., Xxxxxxxx, M., Xxxxxxxxx,
X. and Xxxxxxxxxx, P., 2018. Deep convolutional neural network for automatic detection of damaged photovoltaic cells. Interna- tional Archives of the Photogrammetry, Remote Sensing & Spa- tial Information Sciences.
Xxxxxxx, X. and Xxxxxx, R., 1998. Glossary of terms. Journal of Machine Learning 30(2-3), pp. 271–274.
Remondino, X. and Xxxxx, X., 2010. Reality-based 3d documen- tation of natural and cultural heritage sitestechniques, problems, and examples. Applied Geomatics 2(3), pp. 85–100.
Xxxxx, X., Xxxxx, X., Xxxxx, X., Xxxxx, X. X., Xxxxxx, M. and Xxxxxx, A., 2017. Image based indian monument recognition us- ing convoluted neural networks. In: 2017 International Confer- ence on Big Data, IoT and Data Science (BID), IEEE, pp. 138– 142.
Xxxxxxxx, X. and Xxxxxxxxx, A., 2014. Very deep convolu- tional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Xxxxxxx, C., Xxxxx, S., Xxxxxxxxx, V. and Xxxxx, X. X., 2017. Inception-v4, inception-resnet and the impact of residual connec- tions on learning. In: Thirty-First AAAI Conference on Artificial Intelligence.
Xxxxxxxx, X., 2013. New values of cultural heritage and the need for a new paradigm regarding its care. In: CeROArt. Con- servation, exposition, Restauration dObjets dArt, Association CeROArt asbl.
Xxxx, X. and Xx, B., 2015. Sentiment analysis for social media images. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), IEEE, pp. 1584–1591.
Xxxx, X., Xxxx, S., Xxxx, X., Xxx, X. and Xx, B., 2015. Unsu- pervised sentiment analysis for social media images. In: Twenty- Fourth International Joint Conference on Artificial Intelligence.
Xxx, X., Xx, X. and Xxx, X., 2016. Sentiment prediction in scene images via convolutional neural networks. In: 2016 31st Youth Academic Annual Conference of Chinese Association of Automa- tion (YAC), IEEE, pp. 196–200.
Xxx, Q., Xxx, X., Xxx, X. and Xxxx, X., 2015. Robust image senti- ment analysis using progressively trained and domain transferred deep networks. In: Twenty-Ninth AAAI Conference on Artificial Intelligence.
Xxxxx, X., Xxxx, X., Xxx, Q. and Xxx, X., 2017. When saliency meets sentiment: Understanding how image content invokes emotion and sentiment. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 630–634.