Methodology and Data Quality Control
Local Data Insights does not publish simple raw lists. The data undergoes a cleaning, normalization, geo-validation, and digital signal verification pipeline before being published as a dataset.
This page explains the principles of the LDI methodology, without exposing the detailed internal rules, full dictionaries, or proprietary technical logic.
What LDI Data Reflects
LDI data reflects the visible digital part of a local market, based on the publicly available information at the time of the snapshot.
LDI datasets are not official registries and do not guarantee full coverage of all organizations in a market. They provide a structured view of businesses visible in public digital sources.
ETL Process and Data Pipeline: Bronze → Silver → Gold
LDI uses an internal ETL process — data collection, transformation, and controlled publishing — built on a Bronze → Silver → Gold architecture. The data goes through multiple levels to separate raw information from clean, verifiable data ready for analysis.
Bronze
The Bronze level contains raw data observed in public digital sources. At this stage, there may be duplicates, unclear categories, incomplete addresses, or records that require further verification.
Silver
In the Silver level, the data is cleaned, normalized, and filtered. Exclusion rules, category dictionaries, address checks, status, location, and digital signals are applied.
Gold
The Gold level contains data ready for publication: records that pass quality checks, geo-validation, and consistency of the main fields.
Data Cleaning, Categories, and Noise Elimination
An important part of the LDI methodology is separating relevant businesses from records that do not belong to the analyzed vertical or cannot be used reliably.
- We use category dictionaries to identify relevant businesses for each vertical.
- We exclude categories that do not belong to the analyzed market.
- We eliminate records without usable addresses, where localization is needed for analysis.
- We exclude businesses marked as closed or inactive in the observed sources.
- We normalize the main fields for comparisons between cities, counties, and verticals.
Geo-gate and Spatial Validation
For the data to be used in local analysis and geo-analysis, LDI applies geographic checks before publishing.
Coordinates and Location
We use geographic coordinates when available and verifiable in the dataset context.
Analyzed Area
Points outside the analyzed geography are checked and can be excluded if they do not belong to the defined local market.
Usable Addresses
Records without usable location can be eliminated from datasets where spatial analysis is important.
Website and Digital Signal Verification
LDI treats websites and social media channels as observable digital signals, not just as simple text fields.
- We check website accessibility and eliminate inactive or erroneous links when detected.
- We separate observable websites from missing or unusable fields.
- We identify public social media pages associated with businesses, when visible.
- For Facebook, Instagram, or LinkedIn, the focus is on public pages associated with the business, not personal profiles.
- Digital signals are published as observable indicators, not as an assessment of business quality.
Pre-publication Checks
Before a dataset is published, multiple internal quality checks are applied.
Q1 — Field Consistency
We check whether the main fields required for analysis are present and consistent: name, category, location, status, and identifiers.
Q2 — Location Quality
We check addresses, cities, counties, and coordinates to reduce geographical errors and points outside the analyzed area.
Q3 — Digital Signals
We check website availability, contacts, and observable digital signals before publication.
What Ends Up in the Published Dataset
The published dataset does not include all the records observed initially. In the Gold level, only data that passes relevance, location, and consistency filters are included.
- Relevant businesses for the analyzed vertical.
- Records with usable location for the defined market.
- Businesses that are not marked as closed in the observed sources.
- Public contact data and digital channels available, where they exist.
- Normalized fields for working in CSV and XLSX formats.
Formats and Usage
LDI commercial datasets are prepared for use in analysis tools, spreadsheets, and operational workflows.
CSV
A format suitable for import, analysis, automation, and BI tools. Files are saved in UTF-8 encoding.
XLSX
A format suitable for direct work in Excel, including fields where leading zeros need to be preserved.
Local Analysis
The location fields and coordinates, when available, allow the data to be used for maps and spatial exploration.
What LDI Is Not
For a correct interpretation, it is important to clarify what these datasets do not represent.
Not an Official Registry
The data does not replace official company or institution registries.
Not a Quality Assessment
Digital indicators do not indicate whether a business is good, poor, or recommended.
Does Not Guarantee Full Coverage
The datasets reflect the visible market in public digital sources, not the absolute entirety of the market.
Limitations and Responsibility
LDI data reflects information available in public digital sources at the time of the snapshot. It is not an official registry and does not guarantee full coverage of all organizations in a market. Public sources may change over time, and the use of the data must comply with applicable legislation.
Explore the Data
Check out the published datasets, view samples, or request a custom dataset for a vertical, city, or county that interests you.