Data Quality & General Methodologies

Data Quality & General Methodologies

🔢
In the age of digital transformation, the significance of data in understanding market dynamics, customer preferences, and business opportunities cannot be overstated. Recognising this, The Data Appeal Company (TDAC) stands at the forefront of data analytics, harnessing the power of advanced technologies to provide unparalleled insights into points of interest (POIs) and their digital presence. Our comprehensive methodologies, rooted in cutting-edge machine learning and data science, are designed to navigate the complexities of digital data, ensuring accuracy, reliability, and actionable intelligence. This document delineates the core strategies and innovative approaches we employ to transform raw data into a strategic asset for businesses across the globe.

💾 Data collection

The Data Appeal Company (TDAC) collects and monitors the digital presence of points of interest (POIs) in selected areas. This is done by analysing portals, websites, OTAs, and social media channels.

For each property detected, TDAC explores its type, category, services offered, and contents of the digital human experience (such as reviews, ratings, scores, etc.). They also analyse the origin, language, and type of guests/customers.

When it comes to hospitality-related accommodations, TDAC uses an online property recovery methodology. This involves recognising accommodation properties on various channels. Each hotel or non-hotel property is considered unique, even if it is present on multiple review channels, OTAs, or social media channels.

This process is automated and uses an algorithm designed to maximise the probability that the explored channels are related to the same property.

The discovery algorithm consists of several steps:

  • Identifying a benchmark channel based on the type of property to start the discovery process.
  • Identifying a set of discriminatory variables (both quantitative and qualitative) present on various channels. These variables include geo-location, descriptions, personal information, and reviews.
  • Conducting a web search to find a comparison sample for each channel and identifying the same variables.
  • Creating specific metrics for comparing the data, depending on whether it is quantitative, qualitative, or textual.
  • Applying a classification model to calculate the probability that a given channel is related to the same properties as the benchmark channel.
  • Collecting information from all channels using the previous algorithm.

This process also allows for completing incomplete information on various channels by aggregating all the collected data and building the digital identity of the benchmark property.

☑️ Data Quality

Given the enormous amount of data that is collected from the web on a daily basis and stored in our systems, it is crucial to activate procedures that can identify and, if possible, correct outliers, anomalies, and potential inconsistencies.

The process of ensuring data quality must be automated and involve the use of fast and accurate algorithms. These algorithms should be designed specifically for the problem at hand.

Validation of locations

In the context of location intelligence, it is vital to correctly geolocate the data.

Based on the GPS coordinates, we are able to "isolate" the points of interest (POIs) to be verified. This is done based on several selected alerts within the data:

  • Inconsistencies in personal information
  • Checking the POIs belonging to the external percentiles of the distribution within a given area, known as the "extremes"
  • Validating POIs that have been recently retrieved and have not been previously validated

The validation process involves making API calls to specific providers and comparing the results to verify their consistency. Additionally, a K-nearest neighbours (KNN) algorithm is used to classify POIs based on the characteristics of the nearby POIs.

Normalisation of locations

A useful outcome of the validation process is the normalisation of administrative hierarchies, such as state, region, province, and municipality, across different countries. Our solution has the capability to project different administrative configurations at four comparable cross-country levels: country, state, county, and city, thanks to specific country analyses.

The ability to make coherent comparisons between different countries while maintaining the same location intelligence logic is one of the unique aspects of the TDAC offer. This offer is exclusively provided as there is currently no standard available on the web.

Duplicate check

The risk of having duplicates among millions of POIs is very high. In this case, ad hoc checks were performed on various "sentinel" fields and the connected web channels. A classification algorithm identifies the POIs that require thorough verification or should be discarded as duplicates.

The same applies to the URLs of the monitored channels. Our classification processes help minimise the presence of abnormal or duplicate URLs.

Correction of personal information

For the purposes of the data analysis, it is crucial to correctly classify the POIs (Points of Interest). For instance, distinguishing between a hotel and a B&B is important as it impacts subsequent analyses.

This data is typically obtained from various web channels, but its reliability is not always guaranteed. To address this, an algorithm has been developed to compare the information from different channels. This algorithm utilises benchmark data to classify specific types of POIs or properties.

The algorithm provides a confidence interval, allowing us to:

  • Identify a set of POIs that need to be manually verified and temporarily marked as invalid.
  • Correct the information of certain POIs by overriding the data retrieved from the web channels.

📊 Semantic Analysis

The texts represent our Big Data, so extracting valuable information related to the sentiment and perception expressed by individuals is one of the most important outcomes for our company.

Our ambitious goal is not only to provide a polarity score (sentiment) for each content, but also to identify the main topics, subjects (aspects), and judgments (opinions) connected to these topics.

The algorithm is composed of three models:

  1. A Name Entity Recognition (NER) model aims to classify words or phrases into predefined categories based on unstructured text such as reviews. There is a specific model for each language. Although it is a supervised model that requires training on previously tagged texts, the strength of this algorithm lies in its ability to generalise or "learn" and classify words/phrases that are not part of the training sample. Our algorithm's model has been trained to maximise precision, a metric that ensures the reliability of the output results in terms of quality.
  2. A model connects aspects and opinions when a specific opinion is related to a certain aspect. This is achieved using a Dependency Parser model, which analyses the grammatical structure of a text and identifies a connection tree between words. The logic of dependency parsing varies between languages, but there are similarities in the connections across different languages (e.g. Italian and English).
  3. The Sentiment Analysis model is a classic machine learning model in the field of Natural Language Processing (NLP). It searches for non-linear dependencies between words to computationally understand the logic representing satisfaction and overall polarity in a text. This supervised model, typically a neural network, uses an embedding layer to numerically represent the words in a given dictionary. Like the previous models, it is specific to the language. The strength of this algorithm is that it can provide a multi-class score (positive, negative, and neutral) for any text without requiring specific opinions.

The result is the ability to read and analyse natural language in its original form, identifying its logic and emotional tone.

📉 Cluster Analysis

Exploratory analyses on large amounts of data require a fully data-driven approach. One unsupervised machine learning technique that can be used is clustering. There are various models available for clustering, depending on the specific problem being analysed.

The goal of clustering is to divide a set of objects into N groups, or clusters, based on a specific set of selected variables. These clusters are characterised by the fact that the individuals (i.e. POI) included in each cluster are homogeneous or similar to each other, in order to minimize the total intra-cluster variance.

This type of analysis has numerous applications and is particularly interesting because it allows us to group POIs based on any variable we want to consider.

🔦 Path seeker

The reviews provide access to texts that are the basis of numerous analyses. However, they also offer opportunities for deeper analysis. Since reviews have dates and refer to specific points of interest (POI), they allow us to understand the movement of people and identify recurring patterns in the experiences of travellers or individuals in general.

We have implemented an algorithm that, when given a selected area of interest, provides detailed information on the most frequent visit patterns. This algorithm also indicates the direction of these patterns (where visitors have been before and after), along with other relevant information about the type of visitor. These additional insights enhance the analysis of interest.

🪡 Custom Semantic

Always following a data-driven approach, it is often necessary to carry out customised analyses based on customer preferences or specific requests. In these cases, we go beyond the standard semantic analysis approach, which only allows for the analysis of predefined topics.

The first step is to identify the scope of the analysis in terms of space and time (specific location and time period to be analysed). Since the analysis is customised, the topic can be quite abstract. For example, it could involve assessing the cleanliness of Sicilian beaches, evaluating craft activities in Venice, exploring activities for children in Florence, or understanding the perception of pay TV in restaurants or hotels.

The algorithm used relies on a technique called Word2Vec, which represents words as multidimensional numeric vectors. These vectors possess an important characteristic: words that are used in similar contexts (e.g. hotel, hotel) have similar vectors, making them "neighbours" in a reference vector space. This proximity allows for the enrichment of the analysis on a specific topic with terms and sentences that cannot be predefined, but rather emerge directly from the texts or contexts. For example, during an analysis on beaches, the algorithm suggests (in an interactive approach with the analyst) including terms such as "beach," "coast," and "bay," which expand the scope of the analysis.

The ultimate goal is to identify all the phrases related to the topic of analysis. Once this is done, it is possible to delve deeper and:

  • Identify key phrases and relevant topics (keeping in mind that they are calculated based on sentences related to the analysis topic and are therefore contextualised).
  • Determine the sentiment of the sentences and, consequently, the sentiment associated with the analysed topic.
  • Identify the opinions expressed in these sentences and connect them to the relevant themes mentioned in the previous point.
  • Geolocate the points of interest (POIs) where the topic is being discussed, along with the sentiment associated with these POIs.
  • Analyse the topic from a temporal perspective (e.g. when it is being discussed the most).
  • Summarise everything through specific visuals and a dedicated dashboard.

🖇️ Topic Analysis

Semantic analysis allows us to proactively define the topics we want to monitor and connect related opinions and aspects to these topics. However, there are cases where we need to conduct exploratory analyses without predetermined context or emerging topics.

An unsupervised exploratory technique in the field of Natural Language Processing (NLP) is topic analysis. It helps identify abstract topics within the text and group contents based on these topics. One disadvantage of this technique is that the abstract topics that emerge from the analysis can overlap and be difficult to understand. For example, a topic may combine factors such as "food" and "education," which are distinct factors but may converge in the model.

At TDAC, we have developed a semi-supervised algorithm inspired by this technique. It allows us to control and make the extracted topics more understandable:

  • We search within a corpus (a set of texts) for the most discussed words or phrases to focus on the most relevant topics (referred to as N-Grams in NLP terminology).
  • We map each word/topic from the previous point to a multidimensional vector (word embedding) using an auto-encoder neural network commonly used in NLP (Word2Vec). This technique ensures that words used in similar contexts (e.g., hotel and lodging) have similar multidimensional vectors.
  • To graphically represent these vectors, we reduce the dimensionality using a non-linear technique widely used in NLP called t-distributed stochastic neighbour embedding (t-SNE).
  • We visually project these themes (or rather, their reduced-dimensional vectors) and utilise cluster analysis to identify relevant groups. These groups represent the most popular themes discussed within the analysis.
  • Each cluster represents a topic, ensuring consistency within the topic.
  • We then project each sentence within our corpus onto these topics. This is achieved using similarity metrics (cosine similarity) in multidimensional vector spaces (our embedding).
  • Once we identify the phrases for each topic, we can provide relevant insights. Additionally, we can apply a sentiment analysis algorithm (as mentioned in point 3, Semantic Analysis) to generate a sentiment score for each identified topic.
🔢
In conclusion, TDAC's meticulous approach to data collection, quality assurance, semantic analysis, and advanced analytics represents a paradigm shift in how digital data is leveraged for strategic decision-making. By applying our sophisticated algorithms and data-driven methodologies, we empower organisations to glean deeper insights, identify emerging trends, and make informed decisions that drive growth and innovation. As we continue to evolve and refine our techniques, TDAC remains committed to setting new standards in data analytics, ensuring our clients stay ahead in an increasingly competitive and data-centric world.