Unstructured Data is not what you think it is!

Written by Ronald Baan

Ronald is a data enthusiast who spends his time sharing his passion in data with others.

25 April 2022

You come across this picture or something with the same message in all sorts of places. It seems logical, but there is a world of misunderstanding and wrong approach behind it.

  1. The definition of Unstructured Data is “No pre-defined data model.”
    Nothing could be further from the truth. By unstructured data, they mean documents, video, social media, websites. However, documents, even those 500 years old, can still be read and understood by us because … they have pretty stable semantics and language models. Video is well described in multiple international standards and contains metadata. Twitter’s API certainly does provide a pre-defined data model.
  2. The definition of Structured Data is “Well-defined, easily organized data (in databases).”
    If that is the case, then I would have no more work and we would already have the models that should describe the datasets, easily exchanged. Easily organized, very funny, but not really truthful!
  3. The amounts of data are increasing exponentially, but especially that of “unstructured data. Yet the focus remains only on structured data. Something to think about.
  4. Humor really is everywhere. Take Informatica, a well-known data management company. In their words, “Unstructured data is non-transactional business data, the format of which cannot, or does not, easily conform to a relational database schema. “. – https://lnkd.in/eeEWAQkG
    • “Non-transactional?”
      I think a lot of transactions of interest to organizations are contained in documents, agreements, to name just a few examples.
    • “not, easily conform to a relational database schema”
      Ah, here it is: it doesn’t fit into a database, so basically we don’t know what to do with it and just call it unstructured.

Data management where the data is in databases is relatively easy, but covers only 20% of all data. Let’s look at, manage and use data in all its forms and facets. If it’s easy, it’s less fun, right? We can do this!

In the #DAMA#DMBoK, we have a knowledge area on document and content management. There you can find good things to help with this beautiful form of data.

Also #DAMA sees the bottlenecks with the term “unstructured data.” On page 322 of #DMBoK2:
1.3.10 Unstructured Data.
It is estimated that as much as 80% of all stored data is maintained outside of relational databases. This unstructured data does not have a data model that enables users to understand its content or how it is organized; it is not tagged or structured into rows and columns. The term unstructured is somewhat misleading, as there often is structure in documents, graphics, and other formats, for instance, chapters or headers. Some refer to data stored outside relational databases as non-tabular or semi-structured data. No single term adequately describes the vast volume and diverse format of electronic information that is created and stored in today’s world.

DAMA#DMBoK#data#data management

You may also like…

Layers of Knowledge (Graph)

Layers of Knowledge (Graph)

You can model reality intricately, you can also do it smartly and then make sure systems can easily handle it as well....

The many layers of Data Lineage

The many layers of Data Lineage

Nice article in Medium on #datalineage by Borja Vázquez Barreiros.Data lineage is a tricky one, though so important if...

Data Lake House

Data Lake House

In case you're pretty content with your data lake (or not at all), it's time to upgrade the implementation around the...