Machine Learning & Big Data Blog – BMC Software | Blogs

What Is Dark Data? The Basics & The Challenges

BMC Software — Mon, 05 May 2025 00:00:36 +0000

Dark data and unstructured data are about the same thing. The difference lies in to whom the term is directed. Unstructured data tends to be a word directed at engineers. It refers to the structural qualities of the data, signaling to the engineer how they’ll have to go about refining the data to make any use of it.

Unstructured data is unrefined data, requiring more work to make it usable; structured data is already refined data where the data’s purpose is already determined. Unstructured data is the yin to structured data’s yang, but, mostly, unstructured data comes from an engineering-centric point of view.

What is dark data?

Dark data, however, emerges from the user-centric point of view. Where structured data refers to the structural qualities of the data, dark data refers to the visible qualities of the data. There is data the user can see, like Instagram photos, profile names, hashtags, but then there is data the user cannot see. The Dark Data.

On a social media platform like Instagram, the dark data would be:

How many login instances does the user have?
Does their user activity cluster around certain times of the day?
How many people liked the post who have large networks of users? (To measure a user’s clout.)
From where was the photo taken?
Where was the person when they posted the photo?

People can get overwhelmed by seeing so much data. Standard design practice says Keep It Simple Stupid (KISS) and holds white space as its central virtue. Instagram even decreased the amount of data it showed by generalizing the number of likes a photo would get from a very specific 134,392 to simply saying, “Thousands”.

When the users are the engineers, dark data will refer to unstructured data that does not get analyzed. It’s the data stored through various network processes on servers and in data lakes that ends up sitting around to satisfy the industry’s statute of limitations or is kept because data storage can be so cheap.

Dark data examples

The types of dark data that exist are industry specific. Background weather data might be collected in a running app, and browser history might be collected in a shopping app.

Basically, anything that is sent over the internet has potential to be, and create dark data. Packages are sent from point A to point B. While those packages can be encrypted and those looking in can have a hard time seeing what is in the package itself, there are other known entities in the process.

Types of dark data include:

Log files (servers, systems, architecture, etc.)
Previous employee data
Financial statements
Geolocation data
Raw survey data
Surveillance video footage
Customer call records
Email correspondences
Notes, presentations, or old documents

How much data is dark data?

In order to make software services work, some data must be collected. IP address must be known to get data from somewhere else on the network and return it to a user somewhere else on the network. Artificial Intelligence-backed services are showing how the more data a company has on a user, the better the service they can provide.

The IDC estimates that 90% of data is unstructured data. A.I. is helping make more use of this unstructured data, which should decrease the numbers, but it is so much easier to collect unstructured data than it is to build Machine Learning models to actually do something with it, that, likely, that percentage will increase greatly. In just a few years, dark data could comprise 95-97% of the total percentage of data. If the trend continues, reasonably, Dark Data could comprise 99%+ of all data.

The number is neither good nor bad. Having 99.9999% of all data in the world be dark data means little. It just means there sits a lot of unused data. If anything, that number should signal there might be a great opportunity to turn data into something no one else has.

What are the risks of dark data?

Collecting and storing vast quantities of data that you don’t need and don’t use is not harmless. Dark data opens your organization up to risks and costs.

Data security risks: Because dark data isn’t used, the out-of-sight, out-of-mind mentality can take over. All too often, little thought goes into storing and handling it. Dark data can contain sensitive information that is at risk as a result of such lax data security.
Compliance and regulatory risks: Collecting data you don’t need, particularly when it contains personal or identifiable information, and not putting it under proper protections, can lead to non-compliance with data protection regulations like GDPR or CCPA.
Operational costs: Collecting, storing, and maintaining data is not cheap. Consider the impact on your IT infrastructure. With respect to dark data, you are not getting value to offset the cost.
Missed opportunities: Dark data may contain valuable insights. If you never analyze it, you may miss the chance to uncover trends, boost efficiency, or find ways to generate additional revenue.

How to handle dark data

Privacy with dark data

People are creating their technological footprint with data. This is fine when people don’t mind if others know where they’ve been walking, but, sometimes there’s other items—medical queries, Google searches, less savory sites, and even information you need to hide from a partner or relative—that individuals don’t want others to see.

When it comes to data, security is very challenging.

Challenge 1: Anonymous data

People often think the first step to securing data is to anonymize the data. This means that all the data points can exist, but they’ll remove any account numbers, names, email addresses, etc., from the person’s data so it can’t identify them directly. That method worked in elementary school, when a name was removed from an assignment someone turned in, and it could work for someone like Frank Abagnale as he put new names on checks and diplomas to parade around the country as an airline pilot, doctor, and lawyer.

But data in the technological world works differently. Any set of data points is an identifier. Five data points linked to one person, regardless of the name being given, are an identifier. If someone is known to wake in the morning, go for a walk, sneeze, yawn, kick a rock, go back to sleep, that is an impression of a single identity stamped upon the world.

Challenge 2: Intersections of data

There is so much data out there, that a person’s name can exist among another set of data. Then, when these datasets have data points that intersect, the two sets are cross-referenced, it’s possible to place an identity upon the anonymized data. Creating a Venn diagram of different data sources and finding which ones overlap is a simple option, and statistics invites more complex methods to deanonymize data.

There’s the story of a legal case where an old lady was hit by a car and the car drove off. The woman was able to say the car was yellow (she didn’t know the make), and the driver was a brown man with dark hair. That is not a lot to go off, but a few more dark data points add the time of day of the accident, and the location of the accident. From these four data points, in a town of about 120,000, the investigators were able to narrow down their search, from what seemed to be impossible odds, to having only a few suspects who could have hit the woman.

Similarly, from the technology world, the 7scientists research team presented a similar case at Defcon (see below for video clip). They purchased anonymous browsing data, which is easy to purchase, and showed they were able to identify the user from it based on just five data points.

The graph illustrates how many possible users the browsing data could belong to after each known data point was added.

Open source data privacy

Open Mined is an open-source research group working to make data more privacy-preserving. In a world with more and more dark data, their work benefits the general population to make data more anonymous and ensure that identities are made private even in the increasing amounts of available data.

Specifically, machine learning models are trained upon data. Machine learning models can both offer high value and work with sensitive data. While all data can be considered sensitive, and can be treated equally, legal conditions put medical records among the most sensitive.

Thus, training machine learning models on people’s medical histories is very difficult in nature because of how sensitive the industry has treated the records in the past. Challenges include: not enough data, data being isolated to different locations for security purposes, having to jump through many extra hoops to meet “best safety practices” created by regulatory institutions.

The goal of Open Mined is two-fold: to create a framework where people get paid for their data and to truly anonymize data when passed through ML models. To that end, the open-source group currently offers three major software solutions:

Encrypted Machine Learning as a Service
Privacy Preserving Data Science Platform
Federated Learning

Security isn’t privacy

There is a lot of dark data out there, and there will likely be more. Security practices, as they are, do not preserve privacy with all the dark data points, but research groups are out there successfully improving the data landscape improving people’s privacy, and advocating for people to get paid for the data they create.

Dark data management

Given the issues and opportunities with used and possibly forgotten dark data, developing a formal process for managing it makes sense. You can eliminate the liabilities and risks while unlocking benefits.

Managing dark data can uncover insights that can improve your operations, strengthen customer experiences and loyalty, and lead to innovations and new revenue streams.

Managing dark data can also reduce costs and risks. You can reduce its impact on your IT infrastructure and the costs to collect and store it. You can also mitigate data privacy and data security compliance issues.

Some best practices for dark data management include:

Starting with a data audit that assesses the volume of unused data in your systems.
Implementing a data classification system that identifies valuable, sensitive, obsolete, and unneeded data.
Creating a system for making deletion and retention decisions, and specifying proper handling of retained data.
Ensuring security protocols like encryption, access control, and data lifecycle policies apply to dark data as well as data in regular use.

Learn more about the enterprise BMC AMI solution for data management

Dark analytics

Processing and using dark data to uncover what insights may be locked away in it, and then using those insights to make decisions, is the core of dark data analytics. The vast quantity of unstructured, unanalyzed, and forgotten data may be a gold mine for your organization.

Typical dark data includes log files, sensor data, archived emails, social media interactions, customer call recordings, service records, customer feedback, and more. You may find patterns, like recurring customer complaints that point out a product issue, or uncover a trend that points to an emerging customer need or cybersecurity breach.

Analyzing dark data can lead to a competitive advantage, a growth opportunity, and the mitigation of a previously unseen risk before it becomes a serious problem.

Benefits of dark analytics

Using dark analytics is a hidden superpower for your organization. You can mitigate risks, lower costs, gain a competitive advantage, and make smarter decisions faster with good dark data analytics. Here are some key ways to benefit:

Uncover obstructions, inefficiencies, and slowdowns in processes and operations
Find recurring service issues
Improve resource allocation and trend analysis
Discover customer pain points to address with product innovations
Track customer sentiment on social media about your brand and competitors
Identify behavioral patterns that can lead to customized interactions with users
Refine brand messaging and brand interactions
Learn about opportunities to upsell, cross-sell, or re-engage with customers
Shine a light on unknown security risks
Capture data to improve compliance

Additional resources

For more on this topic, explore the BMC Machine Learning & Big Data Blog or browse these articles:

MongoDB Indexes: Top Index Types & How to Manage Them

BMC Software — Mon, 21 Apr 2025 00:00:30 +0000

MongoDB indexes provide users with an efficient way of querying data. When querying data without indexes, the query will have to search for all the records within a database to find data that match the query.

In MongoDB, querying without indexes is called a collection scan. A collection scan will:

Result in various performance bottlenecks
Significantly slow down your application

Fortunately, using MongoDB indexes fixes both these issues. By limiting the number of documents to be queried, you’ll improve the overall performance of the application.

In this tutorial, I’ll walk you through different types of indexes and show you how to create, find and manage indexes in MongoDB.

(This article is part of our MongoDB Guide. Use the right-hand menu to navigate.)

What are indexes in MongoDB?

MongoDB indexes are special data structures that make it faster to query a database. They speed up finding and retrieving data by storing a small part of the dataset in an efficient way — you don’t have to scan every document in a data collection.

MongoDB indexes store the values of the indexed fields outside of the data collection and keep track of their location in the disk. The indexed fields are ordered by the values. That makes it easy to perform equality matches and to make range-based queries efficiently. You can define MongoDB indexes on the collection level, as indexes on any field or subfield in a collection are supported.

You can manage the indexes on your data collections by using either the Atlas CLI or the Atlas UI. Both make query execution more efficient.

Why do we need indexes in MongoDB?

Indexes are invaluable in MongoDB. They are an efficient way to organize information in a collection and they speed up queries, returning relevant results more quickly. By using an index to group, sort, and retrieve data, you save considerable time. Your database engine no longer needs to sift through each record to find matches.

What are the disadvantages of indexing?

Indexing does have some drawbacks. Performance on writes is affected by each index you create, and each one takes up disk space. To avoid collection bloat and slow writes, create only indexes that are truly necessary.

How many indexes can you use?

MongoDB indexes are capped at 64 per data collection. In a compound index, you can only have 32 fields. The $text query requires a special text index — you can’t combine it with another query operator requiring a different type of special index.

Working with indexes

For this tutorial, we’ll use the following data set to demonstrate the indexing functionality of MongoDB:

use students
db.createCollection("studentgrades")
db.studentgrades.insertMany(
    [
        {name: "Barry", subject: "Maths", score: 92},
        {name: "Kent", subject: "Physics", score: 87},
        {name: "Harry", subject: "Maths", score: 99, notes: "Exceptional Performance"},
        {name: "Alex", subject: "Literature", score: 78},
        {name: "Tom", subject: "History", score: 65, notes: "Adequate"}
    ]
)
db.studentgrades.find({},{_id:0})

Result

Are MongoDB indexes unique?

When creating documents in a collection, MongoDB creates a unique index using the _id field. MongoDB refers to this as the Default _id Index. This default index cannot be dropped from the collection.

When querying the test data set, you can see the _id field which will be utilized as the default index:

db.studentgrades.find().pretty()

Result:

How to create an index in MongoDB

To create an index in MongoDB, use the createIndex()method using the following syntax:

db..createIndex(, )

When creating an index, define the field to be indexed and the direction of the key (1 or -1) to indicate ascending or descending order.

If you use a descending single-field index, it can reduce performance. It is better to use ascending single-field indexes instead, if you want optimal results.

Another thing to keep in mind is the index names. By default, MongoDB will generate index names by concatenating the indexed keys with the direction of each key in the index using an underscore as the separator. For example: {name: 1} will be created as name_1.

The best practice is to use the name option to define a custom index name when creating an index. Indexes cannot be renamed after creation. The only way to rename an index is to first drop that index, which we show below, and recreate it using the desired name.

createIndex() example

Let’s create an index using the name field in the studentgrades collection and name it as student name index.

db.studentgrades.createIndex(
{name: 1},
{name: "student name index"}
)

Result:

Finding indexes in MongoDB

You can find all the available indexes in a MongoDB collection by using the getIndexes() method. This will return all the indexes in a specific collection.

db..getIndexes()

getIndexes() example

Let’s view all the indexes in the studentgrades collection using the following command:

db.studentgrades.getIndexes()

Result:

The output contains the default _id index and the user-created index student name index.

How to list indexes in MongoDB

You can list indexes on a data collection using Shell or Compass. This command will give you an array of index documents:

db.collection.getIndexes()

An alternative is to use MongoDB Atlas UI. Open a cluster and go to the Collections tab. Select the database and collection, then click on Indexes to see them listed.

Lastly, you can use the following command in MongoDB Atlas CLI to see the indexes:

atlas clusters index list --clusterName  --db  --collection

How to delete indexes in MongoDB

To drop or delete an index from a MongoDB collection, use the dropIndex() method while specifying the index name to be dropped.

db..dropIndex()

dropIndex() examples

Let’s remove the user-created index with the index name student name index, as shown below.

db.studentgrades.dropIndex("student name index")

Result:

You can also use the index field value for removing an index without a defined name:

db.studentgrades.dropIndex({name:1})

Result:

The dropIndexes command can also drop all the indexes excluding the default _id index.

db.studentgrades.dropIndexes()

Result:

What are the different types of indexes in MongoDB?

MongoDB provides different types of indexes that can be utilized according to user needs. Here are the main index types in MongoDB:

Single field index
Compound index
Multikey index

In addition to the popular Index types mentioned above, MongoDB also offers some special index types for targeted use cases:

Geospatial index
Test index
Hashed index

Single field index

These user-defined indexes use a single field in a document to create an index in an ascending or descending sort order (1 or -1). In a single field index, the sort order of the index key does not have an impact because MongoDB can traverse the index in either direction.

Example

db.studentgrades.createIndex({name: 1})

Result:

The above index will sort the data in ascending order using the name field. You can use the sort() method to see how the data will be represented in the index.

db.studentgrades.find({},{_id:0}).sort({name:1})

Result:

Compound index

You can use multiple fields in a MongoDB document to create a compound index. This type of index will use the first field for the initial sort and then sort by the preceding fields.

Example

In the following compound index, MongoDB will:

First sort by the subject field
Then, within each subject value, sort by grade

db.studentgrades.createIndex({subject: 1, score: -1})

The index would create a data structure similar to the following:

db.studentgrades.find({},{_id:0}).sort({subject:1, score:-1})

Result:

Multikey index

MongoDB supports indexing array fields. When you create an index for a field containing an array, MongoDB will create separate index entries for every element in the array. These multikey indexes enable users to query documents using the elements within the array.

MongoDB will automatically create a multikey index when encountered with an array field without requiring the user to explicitly define the multikey type.

Example

Let’s create a new data set containing an array field to demonstrate the creation of a multikey index in MongoDB.

db.createCollection("studentperformance")
db.studentperformance.insertMany(
[
{name: "Barry", school: "ABC Academy", grades: [85, 75, 90, 99] },
{name: "Kent", school: "FX High School", grades: [74, 66, 45, 67]},
{name: "Alex", school: "XYZ High", grades: [80, 78, 71, 89]},
]
)
db.studentperformance.find({},{_id:0}).pretty()

Result:

Now let’s create an index using the grades field.

db.studentperformance.createIndex({grades:1})

Result:

The above code will automatically create a Multikey index in MongoDB. When you query for a document using the array field (grades), MongoDB will search for the first element of the array defined in the find() method and then search for the whole matching query.

For instance, let’s consider the following find query:

db.studentperformance.find({grades: [80, 78, 71, 89]}, {_id: 0})

Initially, MongoDB will use the multikey index for searching documents where the grades array contains the first element (80) in any position. Then, within those selected documents, the documents with all the matching elements will be selected.

Geospatial Index

MongoDB provides two types of indexes to increase the efficiency of database queries when dealing with geospatial coordinate data:

2d indexes that use planar geometry which is intended for legacy coordinate pairs used in MongoDB 2.2 and earlier.
2dsphere indexes that use spherical geometry.

Syntax:

db..createIndex( {  : "2dsphere" } )

Text index

The text index type enables you to search the string content in a collection.

Syntax:

db..createIndex( { : "text" } )

Hashed index

MongoDB Hashed index type is used to provide support for hash-based sharding functionality. This would index the hash value of the specified field.

Syntax:

db..createIndex( {  : "hashed" } )

MongoDB index properties

You can enhance the functionality of an index further by utilizing index properties. In this section, you will get to know these commonly used index properties:

Sparse index property
Partial index property
Unique index property

Sparse index property

The MongoDB sparse property allows indexes to omit indexing documents in a collection if the indexed field is unavailable in a document and create an index containing only the documents which contain the indexed field.

Example

db.studentgrades.createIndex({notes:1},{sparse: true})

Result:

In the previous studentgrades collection, if you create an index using the notes field, it will index only two documents as the notes field is present only in two documents.

Partial index property

The partial index functionality allows users to create indexes that match a certain filter condition. Partial indexes use the partialFilterExpression option to specify the filter condition.

Example

db.studentgrades.createIndex(
{name:1},
{partialFilterExpression: {score: { $gte: 90}}}
)

Result:

The above code will create an index for the name field but will only include documents in which the value of the score field is greater than or equal to 90.

Unique index property

The unique property enables users to create a MongoDB index that only includes unique values. This will:

Reject any duplicate values in the indexed field
Limit the index to documents containing unique values

Example

db.studentgrades.createIndex({name:1},{unique: true})

Result:

The above-created index will limit the indexing to documents with unique values in the name field.

Indexes recap

That concludes this MongoDB indexes tutorial and guide. You learned how to create, find, and drop indexes, use different index types, and create complex indexes. These indexes can then be used to further enhance the functionality of the MongoDB databases, increasing the performance of applications which utilize fast database queries.

Real Time vs. Batch Processing vs. Stream Processing

BMC Software — Mon, 17 Mar 2025 00:00:24 +0000

With the constant rate of innovation, developers can expect to analyze terabytes and even petabytes of data in any given period of time. (Data, after all, attracts more data.)

This allows numerous advantages, of course. But what to do with all this data? It can be difficult to know the best way to accelerate and speed up these technologies, especially when reactions must occur quickly.

For digital-first companies, a growing question has become how best to use real-time processing, batch processing, and stream processing. This post will explain the basic differences between these data processing types.

Real time data processing and operating systems

Real-time operating systems typically refer to the reactions to data. A system can be categorized as real-time if it can guarantee that the reaction will be within a tight real-world deadline, usually in a matter of seconds or milliseconds.

One of the best examples of a real-time system are those used in the stock market. If a stock quote should come from the network within 10 milliseconds of being placed, this would be considered a real-time process. Whether this was achieved by using a software architecture that utilized stream processing or just processing in hardware is irrelevant; the guarantee of the tight deadline is what makes it real-time.

Challenges with real-time operating systems

While this type of system sounds like a game changer, the reality is that real-time systems are extremely hard to implement through the use of common software systems. As these systems take control over the program execution, it brings an entirely new level of abstraction.

What this means is that the distinction between the control-flow of your program and the source code is no longer apparent because the real-time system chooses which task to execute at that moment. This is beneficial, as it allows for higher productivity using higher abstraction and can make it easier to design complex systems, but it means less control overall, which can be difficult to debug and validate.

Another common challenge with real-time operating systems is that the tasks are not isolated entities. The system decides which to schedule and sends out higher priority tasks before lower priority ones, thereby delaying their execution until all the higher priority tasks are completed.

More and more, some software systems are starting to go for a flavor of real-time processing where the deadline is not such an absolute as it is a probability. Known as soft real-time systems, they are able to usually or generally meet their deadline, although performance will begin to degrade if too many deadlines are missed.

Real-time processing use cases

When you are continually inputting and processing data, and handling a steady data output stream, you need real-time processing. Here are some real world situations where real-time processing is necessary.

Automated Teller Machines (ATMs): To improve the customer experience, boost back office efficiency and analytics capabilities, and reduce fraud, banks are adopting real-time processing. Processing ATM transactions, immediately posting transactions to the account, and adjusting the balance in real-time requires secure and efficient user credential validation, account balance checks, and on-the-spot transaction authorization.
Air traffic control: Safely managing and moving aircraft in a crowded space requires using data from multiple sources, such as radar, satellite imagery, sensor networks, and aircraft communications. Advanced technologies like AI require real-time data to provide up-to-the-second situational awareness, advance warning of potential conflicts, and the ability to optimize flight paths and manage air space. The ideal is being able to forecast proactive decisions that avoid collisions and minimize congestion.
Anti-lock braking systems (ABS): The vehicle safety advantage of automated and optimized braking requires real-time data. Information about tire-to-road conditions have to be immediate. The second wheel slip is detected, the system regulates dynamic braking force. Speed on straightaways, curves, and angles, along with sensitivity to road surface and conditions like moisture, ice, and oil are all data points to which the system must instantly detect, analyze, and respond to prevent wheel lock-up, maintain vehicle control, and reduce stopping distances in emergency braking situations.

Batch Processing

Batch processing is the processing of a large volume of data all at once. The data easily consists of millions of records for a day and can be stored in a variety of ways (file, record, etc). The jobs are typically completed simultaneously in non-stop, sequential order.

A go-to example of a batch data processing job is all of the transactions a financial firm might submit over the course of a week. Batch data processing is an extremely efficient way to process large amounts of data that is collected over a period of time. It also helps to reduce the operational costs that businesses might spend on labor as it doesn’t require specialized data entry clerks to support its functioning. It can be used offline and gives managers complete control as to when to start the processing, whether it be overnight or at the end of a week or pay period.

Challenges with utilizing batch processing

As with anything, there are a few disadvantages to utilizing batch processing software. One of the biggest issues that businesses see is that debugging these systems can be tricky. If you don’t have a dedicated IT team or professional, trying to fix the system when an error occurs could be detrimental, causing the need for an outside consultant to assist.

Another problem with batch processing is that companies usually implement it to save money, but the software and training requires a decent amount of expenses in the beginning. Managers will need to be trained to understand:

How to schedule a batch
What triggers them
What certain notifications mean

(Learn more about modern batch processing.)

Batch data processing use cases

For less time-sensitive jobs, batch processing can be an efficient option. When it doesn’t matter that processing could take hours or even days, you don’t need real-time or near-real-time options.

End-of-day reporting: Financial institutions typically run end-of-day reports that include bank ledger data, such as starting balance, deposits, withdrawals, and transfers, culminating in an ending balance for the business day. The report supports accuracy by flagging errors and ensures system integrity. It helps managers improve and maintain operational efficiency and provides an audit trail required by law. Activities occur all day, but the processing of the report and analysis is done at a single time after the bank day ends.
Data warehousing: Data in data warehouses is typically managed in scheduled and regular extract, transform, and load (ETL) batch processes. Periodic updates are an efficient way to handle large volumes of data, while ensuring that the data warehouse is kept up-to-date with the latest information, for analytical purposes.
Payroll processing: Company payrolls are usually done in a regular cadence, typically bi-weekly or monthly. Using batch processing streamlines and speeds up the work of collecting timekeeping data, calculating salaries, taxes, and other deductions, and then generating paychecks.

Stream Processing

Stream processing is the process of being able to almost instantaneously analyze data that is streaming from one device to another.

This method of continuous computation happens as data flows through the system with no compulsory time limitations on the output. With the almost instant flow, systems do not require large amounts of data to be stored.

Stream data processing is highly beneficial if the events you wish to track are happening frequently and close together in time. It is also best to utilize if the event needs to be detected right away and responded to quickly. Stream processing, then, is useful for tasks like fraud detection and cybersecurity. If transaction data is stream-processed, fraudulent transactions can be identified and stopped before they are even complete.

Data streaming process

Here are details to illustrate how the data streaming process works.

Event-driven: When something relevant happens, the event triggers a function. It could be clicking a link on a website or, in the case of IoT devices, a smart lighting system could sense sunset and the darkening of a room.
Data flow: Data is processed as it comes, to support workflows without interruption. Examples include rideshare apps, stock trading platforms, and multiplayer games.
Timestamped: This process is helpful when dealing with data that happens in a time series, such as logs, transaction flows, and task flows.
Continuous and heterogeneous: Helpful when dealing with diverse kinds of data in different formats from multiple sources that are continuous, meaning that they change within a range, such as time. Outside temperature and wind speed are examples.

Data streaming characteristics

Low latency: While not zero, processing time is measurable in seconds, even milliseconds.
Scalability: The process supports unexpected and rapid growth in the amount of data handled.
Fault tolerance: The approach requires high uptimes with minimal, even zero, risk of data loss.
State management: Session tracking, windowing, and counting events in a time window when tracking, storing, and processing data.
Distributed processing: This approach ensures workloads are handled by multiple nodes to provide fault tolerance and scalability.
Integration with other systems: The ability to connect and exchange data with other platforms, including other databases, message queues, and analytic tools.

Challenges with stream processing

One of the biggest challenges that organizations face with stream processing is that the system’s long-term data output rate must be just as fast, or faster, than the long-term data input rate otherwise the system will begin to have issues with storage and memory.

Another challenge is trying to figure out the best way to cope with the huge amount of data that is being generated and moved. In order to keep the flow of data through the system operating at the highest optimal level, it is necessary for organizations to create a plan for how to reduce the number of copies, how to target compute kernels, and how to utilize the cache hierarchy in the best way possible.

Stream processing use cases

Stream processing is helpful in several core functions.

Fraud detection: Track and monitor transactions in real-time, flagging activities and events that are suspicious. By quickly identifying potential fraud, you can take steps to either validate a transaction or possibly identify the fraudster or their broader cybercrime exploit.
Network monitoring: It is vital to detect anomalies in network traffic that could indicate either malfunctions or malfeasance. Constant monitoring can quickly detect and address issues.
Predictive maintenance: Monitoring equipment and systems in real-time catches issues early instead of waiting for a failure that could lead to costly downtime. This makes maintenance more efficient and operations more cost-effective.
Intrusion detection: Monitoring for cybersecurity makes it possible to identify unauthorized access in real-time. Responding quickly can limit the harm done and having data helps your organization comply with regulations. You will also better understand the tactics, techniques, and procedures criminals use, to develop better protections moving forward.

Conclusion

While all of these systems have advantages, at the end of the day organizations should consider the potential benefits of each to decide which method is best suited for the use-case.

Additional Resources

Manage sl as for your batch services joe goldberg from BMC Software

Data Lake vs. Data Warehouse vs. Database: Key Differences Explained

BMC Software — Mon, 03 Mar 2025 00:00:38 +0000

Data storage is a big deal. Data companies are in the news a lot lately, especially as companies attempt to maximize value from big data’s potential. For the lay person, data storage is usually handled in a traditional database. But for big data, companies use data warehouses and data lakes.

Data lakes are often compared to data warehouses—but they shouldn’t be. Data lakes and data warehouses are very different, from the structure and processing all the way to who uses them and why. In this article, we’ll:

Define databases, data warehouses, and data lakes
Dive into data lake vs data warehouse
Summarize the big differences
Caution the use of data lakes
Explore the future of data storage
Show how to choose the right data storage solution
And more

Defining database, data warehouse, and data lake

Let’s start with the concepts, and we’ll use an expert analogy to draw out the differences.

What’s a database?

A database is a storage location that houses structured data. We usually think of a database on a computer—holding data, easily accessible in a number of ways. Arguably, you could consider your smartphone a database on its own, thanks to all the data it stores about you.

For all organizations, the use cases for databases include:

Creating reports for financial and other data
Analyzing relatively small datasets
Automating business processes
Auditing data entry

Database examples

When it comes to commercial database systems, you have a variety of choices. Each of the examples below has different features and capabilities to consider when selecting the one that best suits your application.

Popular databases are:

(Learn more about the key difference in databases: SQL vs NoSQL.)

What’s a data warehouse?

The next step up from a database is a data warehouse. Data warehouses are large storage locations for data that you accumulate from a wide range of sources. For decades, the foundation for business intelligence and data discovery/storage rested on data warehouses. Their specific, static structures dictate what data analysis you could perform.

Data warehouses are popular with mid- and large-size businesses as a way of sharing data and content across the team- or department-siloed databases. Data warehouses help organizations become more efficient. Organizations that use data warehouses often do so to guide management decisions—all those “data-driven” decisions you always hear about.

Data warehouse examples

These large-scale data storage solutions make it possible to gather data from multiple sources into a single repository that supports business analytics and decision-making.

Popular companies that offer data warehouses include:

Snowflake
Yellowbrick
Teradata

What’s a data lake?

A data lake is a large storage repository that holds a huge amount of raw data in its original format until you need it. Data lakes exploit the biggest limitation of data warehouses: their ability to be more flexible.

As we’ll see below, the use cases for data lakes are generally limited to data science research and testing—so the primary users of data lakes are data scientists and engineers. For a company that actually builds data warehouses, for instance, the data lake is a place to dump and temporarily store all the data until the data warehouse is up and running. Small and medium sized organizations likely have little to no reason to use a data lake.

Data lake examples

These massive reservoirs are flexible enough to contain raw data that may be structured, partially structured, or fully unstructured. As AI, machine learning, and new types of data analytics increasingly handle a variety of data types and structures, data lakes have emerged as the scalable and flexible solution of choice.

Popular data lake companies are:

Hadoop
Azure
Amazon S3

Data lake vs. data warehouse

Understanding the differences in data storage options will help you make the right choice for your applications. Here is a quick synopsis of what each option offers:

Databases:

Are best for real-time processing of transactions.
Data is highly structured and normalized into rows and columns of values and attributes.
Schemas are rigid and emphasize consistency.
Scalability is limited.

Data warehouses:

Are best for storing business intelligence and analyzing historical data for operational decision-making.
Data comes from multiple sources and is either structured or semi-structured.
Schemas are predefined, but can be adapted to a degree to support analysis.
Scalability is high, to handle large masses of data efficiently and economically.

Data lakes:

Are best for storing massive quantities of data for diverse and yet-to-be defined applications, including machine learning and other advanced uses.
Data can be structured, unstructured, and in various formats from tables, to logs, to images and documents.
Schemas are applied when data is read and used in applications, otherwise data in storage in the lake has no schema.
Scalability and flexibility is highly cost-effective and are the central values of data lakes.

Data Lake vs Data Warehouse vs Database: Illustrating the differences

Lee Easton, president of data-as-a-service provider AeroVision.io, recommends a tool analogy for understanding the differences. In this, your data are the tools you can use.

Imagine a tool shed in your backyard. You store some tools—data—in a toolbox or on (fairly) organized shelves. This specific, accessible, organized tool storage is your database. The tool shed, where all this is stored, is your data warehouse. You might have lots (and lots!) of toolboxes in the shop. Some toolboxes might be yours, but you could store toolboxes of your friends or neighbors, as long as your shed is big enough. Though you’re storing their tools, your neighbors still keep them organized in their own toolboxes.

But what if your friends aren’t using toolboxes to store all their tools? They’ve just dumped them in there, unorganized, unclear even what some tools are for—this is your data lake.

In a data lake, the data is raw and unorganized, likely unstructured. Any raw data from the data lake that hasn’t been organized into shelves (databases) or an organized system (data warehouses) is barely even a tool—in raw form, that data isn’t useful.

Comparing data storage

Now that we’ve got the concepts down, let’s look at the differences across databases, data warehouses, and data lakes in six key areas.

Data

Database and data warehouses can only store data that has been structured. A data lake, on the other hand, does not respect data like a data warehouse and a database. It stores all types of data: structured, semi-structured, or unstructured.

All three data storage locations can handle hot and cold data, but cold data is usually best suited in data lakes, where the latency isn’t an issue. (More on latency below.)

Processing

Before data can be loaded into a data warehouse, it must have some shape and structure—in other words, a model. The process of giving data some shape and structure is called schema-on-write. A database also uses the schema-on-write approach.

A data lake, on the other hand, accepts data in its raw form. When you do need to use data, you have to give it shape and structure. This is called schema-on-read, a very different way of processing data.

Cost

One of the most attractive features of big data technologies is the cost of storing data. Storing data with big data technologies is relatively cheaper than storing data in a data warehouse. This is because data technologies are often open source, so the licensing and community support is free. The data technologies are designed to be installed on low-cost commodity hardware.

Storing a data warehouse can be costly, especially if the volume of data is large. A data lake, on the other hand, is designed for low-cost storage. A database has flexible storage costs which can either be high or low depending on the needs.

Agility

A data warehouse is a highly structured data bank, with a fixed configuration and little agility. Changing the structure isn’t too difficult, at least technically, but doing so is time consuming when you account for all the business processes that are already tied to the warehouse.

Likewise, databases are less agile to configure because of their structured nature.

Conversely, a data lake lacks structure. This agility makes it easy for data developers and data scientists to easily configure and reconfigure data models, queries, and applications. (That explains why data experts primarily—not lay employees—are working in data lakes: for research and testing. The lack of structure keeps non-experts away.)

Security

Data warehouse technologies, unlike big data technologies, have been around and in use for decades. Data warehouses are much more mature and secure than data lakes.

Big data technologies, which incorporate data lakes, are relatively new. Because of this, the ability to secure data in a data lake is immature. Surprisingly, databases are often less secure than warehouses. That’s likely due to how databases developed for small sets of data—not the big data use cases we see today. Luckily, data security is maturing rapidly.

Users

Data warehouses, data lakes, and databases are suited for different users:

Databases are very flexible and thus suited for any user.
Data warehouses are used mostly in the business industry by business professionals.
Data lakes are mostly used in scientific fields by data scientists.

Caution on data lakes

Companies are adopting data lakes, sometimes instead of data warehouses. But data lakes are not free of drawbacks and shortcomings. New technology often comes with challenges—some predictable, others not. Data lakes are no different. It isn’t that data lakes are prone to errors. Instead, companies venturing into data lakes should do so with caution.

Data lake disadvantages

Data lakes won’t solve all your data problems. In fact, they may add fuel to the fire, creating more problems than they were meant to solve. That’s because data lakes tend to overlook data best practices.

Data lakes allow you to store anything without questioning whether you need all the data. This approach is faulty because it makes it difficult for a data lake user to get value from the data.
Data lakes do not prioritize which data is going into a supply chain and how that data is beneficial. This lack of data prioritization increases the cost of data lakes (versus data warehouses and databases) and muddies any clarity around what data is required. This slows, perhaps halts, your entire analytical process. Avoid this issue by summarizing and acting upon data before storing it in data lakes.
Data latency is higher in data lakes. Data lakes are often used for reporting and analytics; any lag in obtaining data will affect your analysis. Latency in data slows interactive responses, and by extension, the clock speed of your organization. Your reason for that data, and the speed to access it, should determine whether data is better stored in a data warehouse or database.
Data lakes do not have rules overseeing what they can take in, increasing your organizational risk. The fact that you can store all your data, regardless of the data’s origins, exposes you to a host of regulatory risks. Multiply this across all users of the data lake within your organization. The lack of data prioritization further compounds your compliance risk.
Data lakes foster data overindulgence. Too much unprioritized data creates complexity, which means more costs and confusion for your company—and likely little value. Organizations should not strive for data lakes on their own; instead, data lakes should be used only within an encompassing data strategy that aligns with actionable solutions.

Data is only valuable if it can be utilized to help make decisions in a timely manner. A user or a company planning to analyze data stored in a data lake will spend a lot of time finding it and preparing it for analytics—the exact opposite of data efficiency for data-driven operations.
Instead, you should always view data from a supply chain perspective: beginning, middle, and end. No matter the data, you should always plan a strategy for how you will:

Find the data
Bring data into organizational data storage
Explore and transform the data

Such an approach allows optimization of value to be extracted from data.

The future is with data warehouses

If data warehouses have been neglected for data lakes, they might be making a comeback. That’s for two main reasons, according to Mark Cusack, CTO of Yellowbrick:

Data warehouse companies are improving the consumer cloud experience, making it easiest to try, buy, and expand your warehouse with little to no administrative overhead.
Data warehousing will become crucial in machine learning and AI. That’s because ML’s potential relies on up-to-the-minute data, so that data is best stored in warehouses—not lakes.

When developing machine learning models, you’ll spend approximately 80% of that time just preparing the data. Warehouses have built-in transformation capabilities, making this data preparation easy and quick to execute, especially at big data scale. And these warehouses can reuse features and functions across analytics projects, which means you can overlay a schema across different features. This reduces duplication and increases your data quality.

As companies embrace machine learning and data science, data warehouses will become the most valuable tool in your data tool shed.

Which data storage solution is right for me?

When it comes to databases vs. data warehouses vs. data lakes, the right choice depends on your use case and the needs of your organization. Some organizations use all three approaches to support various data usage. The hybrid approach balances flexibility, scale, and performance to cost-effectively support a growing number of use cases.

If your organization needs to process transactions efficiently in real-time, databases are ideal.
If you need to support business intelligence analytics and reporting from large quantities of structured and semi-structured data, a data warehouse provides power and performance.
If you anticipate using large volumes of diverse, raw data and applying advanced analytics, machine learning, or big data processing, you are best served by a data lake.

BMC for data solutions

BMC’s award-winning Control-M is an industry standard for enterprise automation and orchestration. And our brand-new SaaS solution BMC Helix Control-M gives you the same organization, control, and orchestration—in the cloud.

Additional resources

For more on this topic, explore these resources:

Database ACID Properties: Atomic, Consistent, Isolated, Durable

BMC Software — Mon, 17 Feb 2025 00:00:04 +0000

Understanding the foundation of reliable and consistent database transactions

What are ACID transactions?
ACID properties in practice
Best practices for using ACID transactions
Additional resources

I don’t think it’s an overstatement to say that data is pretty important. Data is especially important for modern organizations. In fact, The Economist went so far as to say that data surpassed oil as the world’s most valuable resource, which was back in 2017.

One of the problems with data, though, is the massive amounts that need to be processed daily. There’s so much data being generated across the globe these days that we have to come up with a new term just to express how much data there is: big data. Sure, it’s not the most impressive-sounding term out there, but the fact remains.

With all this big data out there, organizations seek ways to improve how they manage it from a practical, computational, and security standpoint. Like Spiderman’s Uncle Ben once said:

“With great [data] comes great responsibility.”

The best method the IT world has created for navigating the complexities of data management is using databases.

What is a database?

Databases are structured sets of data that are stored within computers. Oftentimes, databases are stored on entire server farms filled with computers that were made specifically for the purpose of handling that data and the processes necessary for making use of it.

Modern databases are such complex systems that management systems have been designed to handle them. These database management systems (DBMS ) seek to optimize and manage the storage and retrieval of data within databases.

The ACID approach is one of the guiding stars leading organizations to successful database management.

What are ACID transactions?

In the context of computer science and databases, ACID stands for:

Atomicity
Consistency
Isolation
Durability

ACID is a set of guiding principles that ensure database transactions are processed reliably. A database transaction is any operation performed within a database, such as creating a new record or updating data within one.

ACID transactions are operations made within a database that need to be performed with care to ensure the data doesn’t become corrupted. Applying the ACID properties to each database modification is the best way to maintain its accuracy and reliability.

Let’s look at each component to understand the full meaning of ACID in database management.

Atomicity

In the context of ACID properties of a database, atomicity means that you either:

Commit to the entirety of the transaction occurring
Have no transaction at all

Essentially, an atomic transaction ensures that any commit you make finishes the entire operation successfully. In case of a lost connection in the middle of an operation, the database is rolled back to its state prior to the commit being initiated.

This is important for preventing crashes or outages from creating cases where the transaction was partially finished to an unknown overall state. If a crash occurs during a transaction with no atomicity, you can’t know exactly how far along the process was before the transaction was interrupted. Using atomic transactions principle, you ensure that either the entire transaction is successfully completed—or that none of it was.

Consistency

In ACID database management, consistency refers to maintaining data integrity constraints.

A consistent transaction will not violate integrity constraints placed on the data by the database rules. Enforcing consistency ensures that if a database enters into an illegal state (if a violation of data integrity constraints occurs) the process will be aborted and changes rolled back to their previous legal state.

Another way of ensuring consistency within a database throughout each transaction is by also enforcing declarative constraints placed on the database.

An example of a declarative constraint might be that all customer accounts must have a positive balance. If a transaction would bring a customer account into a negative balance, that transaction would be rolled back. This ensures changes are successful at maintaining data integrity or they are canceled completely.

Isolation

Isolated transactions are considered to be “serializable”, meaning each ACID transaction happens in a distinct order without any transactions occurring in tandem.

Any reads or writes performed on the database will not be impacted by other reads and writes of separate transactions occurring on the same database. A global order is created, with each transaction queueing up in line to ensure the transactions complete in their entirety before another one begins.

Importantly, this doesn’t mean two operations can’t happen at the same time. Multiple transactions can occur as long as those transactions have no possibility of impacting the other transactions occurring at the same time.

Doing this can have impacts on the speed of transactions as it may force many operations to wait before they can initiate. However, this tradeoff is worth the added data security provided by isolation.

In an ACID database, isolation can be accomplished through the use of a sliding scale of permissiveness that goes between what are called optimistic transactions and pessimistic transactions:

An optimistic transaction schema assumes that other transactions will complete without reading or writing to the same place twice. With the optimistic schema, both transactions will be aborted and retried in the case of a transaction hitting the same place twice.
A pessimistic transaction schema provides less liberty and will lock down resources, assuming that transactions will impact others. This results in fewer aborts and retries, but it also means that transactions are forced to wait in line for their turn more often than with the optimistic transaction approach.

Finding a sweet spot between these two ideals is often where you’ll find the best overall result.

Durability

The final aspect of the ACID approach to database management is durability.

Durability ensures that changes made to the database (transactions) that are successfully committed will survive permanently, even in the case of system failures. This ensures that the data within the database will not be corrupted by:

Service outages
Crashes
Other cases of failure

Durability is achieved through the use of changelogs that are referenced when databases (or portions of the database) are restarted.

ACID supports data integrity & security

When every aspect of the ACID approach is brought together successfully, databases are maintained with the utmost data integrity and data security to ensure that they continuously provide value to the organization. A database with corrupted data can present costly issues due to the huge emphasis that organizations place on their data for both day-to-day operations as well as strategic analysis.

Using ACID properties with your database will ensure your database continues to deliver valuable data throughout operations.

ACID properties in practice

ACID in relational database

Using ACID properties in real-world applications such as relational databases is crucial for maintaining data integrity.

SQL Server fully complies with ACID principles, making it ideal for guaranteeing strong data integrity. Using MySQL, ACID compliance is storage-engine dependent.

ACID in noSQL database

MongoDB, a NoSQL database oriented around documents, was not originally designed around ACID properties. However, newer versions incorporate ACID at the document level, so it is fast approaching SQL Server in terms of data integrity.

Best practices for using ACID transactions

To effectively use ACID guiding principles, follow these best practices:

Model data to store related data together. You will reduce the risk of inconsistencies, promote scalability, and support more efficient access and update efficiency.
Break long-running transactions into smaller pieces. Shorter transactions require fewer resources.
Limit transactions to 1,000 document modifications. This allows the system to handle large data volumes while maintaining stability and performance.
Configure appropriate read and write concerns to better balance data consistency, system performance, and availability.
Handle errors and retry transactions that fail due to transient errors. You will reduce the number of failed operations and ensure that more ACID transactions fully complete to preserve data consistency.
Be aware of performance costs for transactions affecting multiple shards. To reduce latency and potential failures, you may wish to localize related transactions on one share.
Enforce data integrity using constraints that prevent invalid data entry and validate data before it gets to the database.
Use logging and backups to prevent data loss, schedule frequent backups and test them thoroughly.

How To Use mongodump for MongoDB Backups

BMC Software — Fri, 14 Feb 2025 00:00:01 +0000

Maintaining backups is vital for every organization. Data backups act as a safety measure where data can be recovered or restored in case of an emergency. Typically, you create database backups by replicating the database, using either:

Built-in tools
Specialized external backup services

Backing Up MongoDB

MongoDB offers multiple inbuilt backup options depending on the MongoDB deployment method you use. We’ll look briefly at the options, but then we’ll show you how to utilize one particular option—MongoDB mongodump—for the backup process.

(This article is part of our MongoDB Guide. Use the right-hand menu to navigate.)

Built-in backups in MongoDB

Here are the several options you have for backing up your data in MongoDB:

MongoDB Atlas Backups

If MongoDB is hosted in MongoDB Atlas cloud database service, the Altas service provides automated continuous incremental backups.

Additionally, Altas can be used to create cloud provider snapshots, where local database snapshots are created using the underlying cloud providers’ snapshot functionality.

MongoDB Cloud Manager or Ops Manager

Cloud Manager is a hosted backup, monitoring, and automation service for MongoDB. Cloud Manager enables easy backup and restores functionality while providing offsite backups.

Ops Manager provides the same functionality as Cloud Manager but it can be deployed as an on-premise solution.

MongoDB mongodump

Mongodump database backup is a simple MongoDB backup utility that creates high fidelity BSON files from an underlying database. These files can be restored using the mongorestore utility.

Mongodump is an ideal backup solution for small MongoDB instances due to its ease of use and portability.

File system backups

In this method, you merely keep copies of the underlying data files of a MongoDB installation. We can utilize snapshots if the file system supports it.

Another way is to use a tool like rsync where we can directly copy the data files to a backup directory.

What is MongoDB mongodump?

The mongodump is a utility for creating database backups. The utility can be used for creating binary export of the database contents. Mongodump can export data from both mongod and mongos instances, allowing you to create backups from:

A standalone, replica set
A sharded cluster of MongoDB deployments

Before MongoDB 4.4, mongodump was released alongside the MongoDB server and used matched versioning. The new iterations of mongodump are released as a separate utility in MongoDB Database Tools. Mongodump guarantees compatibility with MongoDB 4.4, 4.2, 4.0, and 3.6.

The mongodump utility is supported on most x86_64 platforms and some of ARM64, PPC64LE, and s390x platforms. You can find the full list of platforms that mongodump is compatible with from their documentation.

Mongodump actions & limitations within MongoDB

The following list breaks down the expected behaviors and limitations of the mongodump utility.

The mongodump utility directs its read operations to the primary member of a replica set, making the default read preference to primary.
The backup operation will exclude the “local” database and only captures the documents excluding the index data. These indexes must be rebuilt after a restoration process.
When it comes to backing up read-only views, mongodump only captures metadata of views. If you want to capture documents within a view, use the “–viewsAsCollections” flag.
To ensure maximum compatibility, use Extended JSON v2.0 (Canonical) for mongodump metadata files. It is recommended to use the corresponding versions of mongodump and mongorestore in backup and restore operations.
The mongodump command will overwrite the existing files within the given backup folder. The default location for backups is the dump/ folder.
When the WiredTiger storage engine is used in a MongoDB instance, the output will be uncompressed data.
Backup operations using mongodump is dependent on the available system memory. If the data set is larger than the system memory, the mongodump utility will push the working set out of memory.
If access control is configured to access the MongoDB database, users must have enough privileges to each database to make backups. MongoDB has a built-in backup role with required privileges to backup any database.
MongoDB allows mongodump to be a part of the backup strategy for standalone or a replica set.
Starting with MongoDB 4.2, mongodump cannot be used as a part of the backup strategy when backing up sharded clusters that have sharded transactions in progress. In these instances, it is recommended to use a solution like MongoDB Cloud Manager or Ops Manager, which maintain the atomicity in transactions across shards.
The mongodump command must be executed from the system command shell as it is a separate utility.
There is no option for incremental backups. All backups will make a full copy of the database.

MongoDB Database Tools

MongoDB Database Tools are a collection of command-line utilities that help with the maintenance and administration of a MongoDB instance. The MongoDB Database tools are compatible in these environments:

Windows
Linux
macOS

In this section, we will take a look at how we can install the Database Tools on a Linux server.

Checking for Database Tools

To check if the database tools are already installed on the system, we can use the following command.

sudo dpkg -l mongodb-database-tools

Result for Database Tools installed:

Result for Database Tools unavailable:

Installing Database Tools

If your system doesn’t have Database Tools, here’s how to install it.

The MongoDB download center provides the latest version of MongoDB Database Tools. Download the latest version according to your platform and package type. In a CLI environment, we can copy the download link and use wget or curl to download the package.

In the example below, we will be using the Database Tools version 100.2.1 for Ubuntu as a deb package and then install using the downloaded file.

curl -o mongodb-database-tools-ubuntu2004-x86_64-100.2.1.deb https://fastdl.mongodb.org/tools/db/mongodb-database-tools-ubuntu2004-x86_64-100.2.1.deb
sudo apt install ./mongodb-database-tools-ubuntu2004-x86_64-100.2.1.deb

Result:

Using MongoDB mongodump

In this section, we will cover the basic usage of mongodump utility in a standalone MongoDB instance.

Basic mongodump Syntax

mongodump

The most basic method to create a backup is to use the mongodump command without any options. This will assume the database is located in localhost (127.0.0.1) and using port 27017 with no authentication requirements. The backup process will create a dump folder in the current directory.

mongodump

Result:

We can navigate to the dump folder to verify the created backups.

Backing up a remote MongoDB instance

We can specify a host and a port using the –uri connection string.

Connect using the uri option:

mongodump --uri="mongodb://:" [additional options]

Connect using the host option:

mongodump --host=":"  [additional options]

Connect using host and port options:

mongodump --host="" --port= [additional options]

The following example demonstrates how to create a backup of the remote MongoDB instance:

mongodump --host="10.10.10.59" --port=27017

Result:

Backing up a secure MongoDB instance

If we want to connect to a MongoDB instance with access-control, we need to provide:

Username
Password
Authentication database options

Authentication Syntax

mongodump --authenticationDatabase= -u= -p= [additional options

Let’s see how we can connect to a remote MongoDB instance using a username and password.

mongodump --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword"

Result:

Selecting databases & collections

Using the –db and –collection options, we can indicate a database and a collection to be backed up. The –db option can be a standalone option, but to select a collection a database must be specified. To excuse a collection from the backup process, we can use the –excludeCollection option.

Selecting a database:

mongodump  --db= [additional options]

Selecting a collection:

mongodump  --db= --collection= [additional options]

Excluding a collection:

mongodump  --db= --excludeCollection= [additional options]

In the following example, we define the “vehicleinformation” collection as the only backup target.

mongodump --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" --db=vehicles --collection=vehicleinformation

Result:

Changing the backup directory

The –out option can be used to specify the location of the backup folder.

mongodump --out= [additional options]

Let us change the backup directory to the “dbbackup” folder.

mongodump --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" --out=dbbackup

Result:

Creating an archive file

The mongodump utility allows us to create an archive file. The –archive option can be used to specify the file. If no file is specified the output will be written to standard output (stdout).

The –archive option cannot be used in conjunction with the –out option.

mongodump --archive= [additional options]

The below example demonstrates how we can define an archive file.

mongodump --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" --archive=db.archive

Result:

Compressing the backup

The backup files can be compressed using the –gzip option. This option will compress the individual JSON and BSON files.

mongodump --gzip [additional options]

Let’s compress the complete MongoDB database.

mongodump --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" --gzip

Result:

In this article, we covered the essential steps for using MongoDB’s mongodump to create and manage database backups. By following the instructions provided, you now have the tools to efficiently handle your mongodump database backup needs.

Mongorestore Examples for Restoring MongoDB Backups

BMC Software — Wed, 12 Feb 2025 00:00:33 +0000

It is essential to have an efficient and reliable data restoration method after backing up data during the backup and restore process. Consider the differences:

A properly configured restoration method means users can successfully restore the data to the previous state.
A poor restoration method makes the whole backup process ineffective, by preventing users from accessing and restoring the backed-up data.

The mongorestore command is the sister command of the mongodump command. You can restore the dumps (backups) created by the mongodump command into a MongoDB instance using the mongorestore command.

In this article, you will learn how to utilize the mongorestore command to restore MongoDB backups effectively.

(This article is part of our MongoDB Guide. Use the right-hand menu to navigate.)

What is mongorestore?

mongorestore is a simple utility that is used to restore backups. It can load data from either:

A database dump file created by the mongodump command
The standard input to a mongod or mongos instance

Starting with MongoDB 4.4, the mongorestore utility is not included in the base MongoDB server installation package. Instead, it’s distributed as a separate package within the MongoDB Database Tools package. This allows the utility to have a separate versioning scheme starting with 100.0.0.

The mongorestore utility offers support for MongoDB versions 4.4, 4.2, 4.0, and 3.6. It may also work with earlier versions of MongoDB, but compatibility is not guaranteed.

Additionally, mongorestore supports a multitude of platforms and operating systems ranging from x86 to s390x; you can see the full compatibility list in the official documentation.

Mongorestore behavior

Here is a list of things you need to know about the Mongo restore command and behaviors of the mongorestore utility.

The mongorestore utility enables users to restore data to an existing database or create a new database. When restoring data into an existing database, mongorestore will only use insert commands and does not perform any kind of updates. Because of that, existing documents with a matching value for the _id field of the documents in the backup will not be overwritten by the restoration process.

This will lead to a duplicate key error during the restoration process, as shown here:

As mongodump does not backup indexes, the mongorestore command will recreate indexes recorded by mongodump.
The best practice when backing up and restoring a database is to use the corresponding versions of both mongodump and mongorestore. If a backup is created using a specific version of the mongodump utility, it is advisable to use its corresponding version of the mongorestore utility to perform the restore operation.
Mongorestore will not restore the “system.profile” collection data.
The mongorestore utility is fully compliant with FIPS (Federal Information Processing Standard) connections to perform restore operations.
When restoring data to a MongoDB instance with access control, you need to be aware of the following scenarios:
- If the restoring data set does not include the “system.profile“ collection and you run the mongorestore command without the –oplogReply option, the “restore” role will provide the necessary permissions to carry out the restoration process.
- When restoring backups that include the “system.profile” collection, even though mongorestore would not restore the said collection, it will try to create a fresh “system.profile” collection. In that case, you can use the “dbAdmin” and “dbAdminAnyDatabase” roles to provide the necessary permissions.
- To run the mongorestore command with the –oplogReply option, the user needs to create a new user-defined role with anyAction in anyResource permissions.
You can’t use the mongorestore utility in a backup strategy when backing up and restoring sharded clusters with sharded transactions in progress. This change was introduced in MongoDB 4.2. When dealing with shared clusters, the recommended solutions would be MongoDB Cloud Manager or Ops Manager, as they can maintain the atomicity of the transactions across shards.
The mongorestore command should be executed from the command shell of the system because it is a separate database utility.

Using MongoDB mongorestore

In this section, you will find out the basic usage of the mongorestore utility in a standalone MongoDB instance.

Basic mongorestore syntax

mongorestore

The basic way to restore a database is to use the mongorestore command to specify the backup directory (dump directory) without any options. This option is suitable for databases located in the localhost (127.0.0.1) using the port 27017. The restore process will create new databases and collections as needed and will also log its progress.

mongorestore ./dump/

Result:

In the above example, you can see how to successfully restore the “vehicles” database with all its collections and documents. This will create a new database named vehicles in the MongoDB instance containing all the collections and documents of the backup. You can verify the restoration process by logging into the MongoDB instance.

use vehicles
show collections
db.vehicleinformation.find()

Result:

Restoring data into a remote MongoDB instance using mongostore

In order to restore data into a remote MongoDB instance, you need to establish the connection. The connection to a database can be specified using either:

The URI connection string
The host option
The host and port option

Connecting using the URI option:

mongorestore [additional options] --uri="mongodb://:" [restore directory/file]

Connecting using the host option:

mongorestore [additional options] --host=":"  [restore directory/file]

Connecting using host and port options:

mongorestore [additional options] --host="" --port=  [restore directory/file]

The mongostore example below shows you how to restore a backup of the remote MongoDB instance. The verbose option will provide users with a detailed breakdown of the restoration process.

mongorestore --verbose --host="10.10.10.59" --port=27017 ./dump/

Result:

Restoring a secure MongoDB instance

When connecting to an access-controlled MongoDB instance, you need to provide:

Username
Password
Authentication database options

Additionally, mongorestore supports key-based authentications. It is necessary to ensure that the authenticated user has the required permissions/roles in carrying out the restoration process.

Authentication syntax:

mongorestore [additional options] --authenticationDatabase= -u= -p= [restore directory/file]

The following restoration command shows how to connect to a remote MongoDB server using the username and password for authentication.

mongorestore --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" ./dump/

Result:

Selecting Databases and Collections

Using the –nsInclude option, users can specify which database or collection needs to be restored. When using the –nsInclude option, you can use the namespace pattern (Ex: “vehicles.*”, “vehicles.vehicleInformation”) to define which database or collection should be included.

To specify multiple namespaces, you can use the –nsInclude command multiple times in a single command. The -nsInclude command also supports wildcards to be added in the defined namespace.

The –db and –collection options are deprecated and will result in the following error.

To exclude a database or a collection, you can use the –nsExclude command.

Selecting a Database/Collection:

mongorestore [additional options] --nsInclude= (${DATABASE}.${COLLECTION}) [restore directory/file]

Excluding a Database/Collection:

mongorestore [additional options] --nsExclude= (${DATABASE}.${COLLECTION}) [restore directory/file]

In the following example, you will see how to restore the complete “persons” database. You can include the whole database by specifying the “persons” namespace with the asterisk as a wild card pattern. This will restore all the data within the database.

mongorestore --nsInclude=persons.* --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" ./dump/

Result:

Use mongostore to restore data from an Archive File

The mongorestore utility supports restorations from an archive file. The –archive option can be used to select the archive file, and the –nsInclude and –nsExclude options can be used in conjunction with the archive option.

mongorestore [additional options] --archive=

The below example illustrates how to define an archive file when restoring data. The –nsInclude option is used to specify which collection is to be restored to the database from the archive file.

mongorestore -v --nsInclude=vehicles.vehicleinformation --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" --archive=db.archive

Result:

Restoring data from a Compressed File

The mongodump utility uses the –gzip option to compress the individual JSON and BSON files. These compressed backups can also be used to restore the database. The compressed file can also be filtered using the –nsInclude and –nsExclude options.

mongorestore --gzip [additional options] [restore directory/file]

You can restore a compressed MongoDB backup using the following commands. The compressed backup is stored in the “backupzip” directory.

mongorestore --gzip -v --nsInclude=vehicles.vehicleinformation --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" ./backupzip/

Result:

The same process can be applied to a compressed archive file. The below mongostore example shows how to restore data from a compressed archive file.

mongorestore --gzip -v --nsInclude=vehicles.vehicleinformation --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" --archive=db.archive

Result:

Mongostore command for restoring data from standard input

The mongostore command “mongorestore” enables users to read data from standard inputs and use that data in the restoration process. You can read the data by providing the –archive option without the filename.

mongodump [additional options] --archive | mongorestore [additional options] –archive

The following example shows how to create a backup from a secure MongoDB database using mongodump and pass it as standard input to the mongorestore command to be restored in an insecure remote MongoDB instance.

mongodump --host=10.10.10.59 --port=27017 --authenticationDatabase="admin" -u="barryadmin" -p="testpassword" --db=vehicles --archive | mongorestore --host=10.10.10.58 --port=27018 --archive

Result:

You can verify the restoration process by checking the databases with the remote server. This can be done by executing a JavaScript code using –eval tag. Using the “listDatabases” admin command, you will be able to list all the databases within the remote MongoDB instance.

mongo --host=10.10.10.58 --port=27018 --quiet --eval 'printjson(db.adminCommand( { listDatabases: 1 } ))'

Result:

mongorestore & mongodump

This tutorial offers in-depth knowledge about the MongoDB mongorestore utility and how it can be used to restore backups created by mongodump. The mongorestore offers a convenient and efficient way of restoring database backups.

The combination of mongodump with mongorestore provides small scale database administrators with a complete backup strategy.

AI Augmentation: Harnessing the Power of Human-AI Collaboration

BMC Software — Tue, 04 Feb 2025 00:00:31 +0000

Many predictions to the outcome of humans and artificial intelligence take an either/or approach.

Skynet is determined to end the human race in The Terminator. In The Matrix, the machines have learned to farm humans for battery power. Then, there are the reports that AI beats the best Go and Chess players in the world, and there is no shot for a human victory.

These views make it seem that in order to win, humans must exist free of computer control. A more likely outcome—a better picture of success—is to say, “We found our peace through AI/human augmentation.”

AI augmentation is a view that sees the story of humans and machines as one of cooperation. It puts the human in the driver seat and focuses on how AI becomes assistive to enhancing human capabilities—like with hearing, seeing, and making decisions. Communication is an important factor to AI human augmentation. Communication can happen through sensors, human-in-the-loop learning, user surveys. Finally, our ability to create, detect, and communicate with these AIs, depends on how we believe they work.

AI/human augmentation is the most plausible look at the future of human computer interactions and deserves more attention.

What is human augmentation?

“Computers are like a bicycle for our minds.”–Steve Jobs

Computers were a leap in tool-building that made people far more capable than before. Computers allowed people to:

Solve a problem once.
Write lines of code to perform it.
Let a computer perform the task over and over.

The development of artificial intelligence (AI) has inserted itself as a prologue to these steps in problem solving. It is in this step of human decision-making that AI is having its profound impact.

AI has allowed a developer to solve how to make a decision around the same problem. Before, when a similar problem presented itself, a developer would have to sit down and write another program to tackle this new variation in the problem. But with AI, the developer is tasked with defining the scope of the problem, and then allowing the AI to make the decision for what the outcome should be.

AI/human augmentation example

In a brief example, an engineer must meet the requirements to construct the design of a building. An engineer needs to know whether 2x6s or 2x12s can cover the span from wall to wall. If there is a second floor, can the joists support the weight of the people walking on the floor? If there is a roof load passed to the floor system, are the beams and joists large enough to support the weight passed to them?

Given the design, the engineer has options on how to best engineer the ceiling and roof system in order to:

Cost the least amount of money in timber costs
Continue to support the structure

If the engineer were a computer developer, however, perhaps they could create a program with a guided set of rules to engineer the house for them. The limitation in previous computer programs has been that, for each new house the engineer would build, he would have to create a new program to do the engineering of the house. Every new design will be governed by the same rules (a 2×6 can only support so much weight—its variable is constant across houses).

Here is where the AI comes into play: AI lets the engineer write a piece of software that can do the engineering for all houses instead of having to write a new one each time a new design comes across their table.

Human AI collaboration

When the AI is created to make decisions around common problems, and used to assist in making decisions, the relationship between person and AI actually transforms who people are.

The very things people can accomplish changes. The day-to-day activities people perform changes. People are no longer set to fixate in one domain, solving the same problem over and over. Instead, people can:

Travel across many domains.
Understand a domain’s problem.
Create an AI to tackle that problem.

Benefits of AI (Does AI destroy jobs—or do we?)

In the human-AI dance, aside from human extinction, the primary concern people have is that AI is going to eliminate many jobs.

I must say, jobs have always had a lifecycle. Airline pilots used to be a top-of-society, high-end job. Banking and finance took a top spot in the ‘80s. These things change based on:

Skill levels
Competition
Available technologies

Next, having old jobs persist into the future is not an appropriate standard by which to measure society.

If, suddenly, there were no more oil jobs because everyone worked at nuclear plants—because that is the energy in demand—that is not a bad thing. It is an adjustment to the times. If there were still floors of telephone switchboard operators in existence in today’s labor market, it’d be easy to wonder, “Why is that company still doing things that way?”

Instead of being concerned about AI eliminating jobs, the real questions we ask should be:

Are there enough new jobs for people to get into?
Are training and education easily accessible and affordable?
Is there plenty of investment money in circulation to take risks on new industries and new ideas that lead to creating places for new workers to work?

The effect AI/human augmentation has on the market is, of course, eliminating old ones—but that is not the only thing, nor is that a bad thing.

AI enables focused and meaningful work

AI’s entering the marketplace will also remove the menial, day-to-day tasks and let people work on the things that matter.

It can remove work nobody looks forward to doing. Whether it is because it is a communication hell or it is a lot of minute, detailed work, alone in a dark room which has to get done so food can be put on the table…all those things, AI can do it.

People can use AI to eliminate the dull tasks, and place greater human attention on those things that matter. People can perfect their roles in certain areas. It is the elimination of simple tasks. It is more quality time spent doing the things that matter. Some will be gone, but new ones will also be created.

In photo or video editing, for instance, instead of spending hours editing a photo to include a person who was not present in the original photo, the editing can just be done by using the Photoshop technology with perfect blurring, lighting, and blending techniques. Honestly, you might not even know there is AI being used in the background.

The features and the things that you are capable of will increase—and they’ll increase in complexity. (But only for those who knew how it used to be done. Young eyes get to start with this tech and see how it transforms over time.)

The photo’s creator just has to give direction to what exactly they want. The photographer can spend more time trying to communicate their message, and less time doing the actual engineering.

AI enables people as creative curators

If an AI can fill in a lot of the processing steps, people have more time to iterate over different designs before selecting their final one.

In architecture, the design process might work similar to photos. The designers can say, we want walls here, here, and here. Then the AI, which can do all the structural engineering to ensure the building stands and passes city codes, can figure out how to construct the layout. If it is not liked, the designers can undo it, and try another variation.

In a perfect setup, people’s roles can shift from doing the grunt work to actually designing. Industries where people’s roles turn into creative curators:

Architecture
Painting
Photography
Music
Writing
Coding

Perfect elimination of the grunt work is a look at the ideal, but there will always be some form of new thing. As people relieve themselves from one challenging endeavor, they start using their newly freed time to push the boundaries of the new technology, ultimately creating a new problem.

Animation has been happening through various technological forms for 100+ years, and, even with computers and AI used to help, there are now more people working on a team to create an animated film than there were 100 years ago.

Human augmentation examples

Human/AI augmentation in education

I think one opportunity where AI/human augmentation can present a clear and present shift is with teachers. I am the son of a high school teacher, and I only speak about this from the heart.

Like how an AI can make better moves across a chess board to defeat an opponent, an AI might become the best at selecting the “when and how” to dispense knowledge to educate a student. The AI could do the laborious tasks:

Creating a curriculum
Designing a syllabus
Grading tests

The value of using an AI is to define the problem space and allow the AI to create many unique solutions. That means an AI can create a unique educational path for each student—which is far beyond the capabilities of any educational institution today.

The best organizational structure schools have found, thus far, is to create two or three different paths with a base-level path, then one or two accelerated path options (i.e.; AP and IB classes). More variation comes in the extracurriculars students get themselves into.

For education, this use of AI actually sounds great: Students can take a path suited to them, driven by their curiosity, and move down it at their own pace.

With AI in the mix, the teacher’s role is necessary because the presentation of new information is always met with resistance by a learner, and students handle the moment in various ways. The teacher becomes the students’ guide to instill discipline when approaching new topics. They:

Encourage to persist through the struggle
Act as a soundboard for feedback

And, with their lesson planning offloaded, teachers have the time to tend to this core responsibility. Teachers act as guides to knowledge acquisition, not dispensers of knowledge.

AI supports focused problem solving

The clear definition of success makes the use of AI in education possible. Papers are graded, and, by the end, the goal is to have a higher-level understanding of the maths, sciences, and the humanities.

Outside the educational world, however, once a person enters life—“the real world”—the path to success is not so clearly defined. It is not agreed upon, and it is not measured by passing grades, nor has another level of achievement.

Life is harder, maybe even impossible, to define what a “win” might be, yet we continue attempting to define the game. For example, there are paths through college, careers, family-planning, homeownership, and probably… I have lost a few people already in what qualifies as a successful path through life.

What is a successful life? This is far too broad of a problem for AI to solve. And it’s likely the wrong way of thinking about this.

Because AIs have limitations, the kinds of applications that come from AI/human augmentation are therefore limited. An AI cannot assist with broad problems like designing choices through life, but it can assist with narrow scopes like:

Improving your grammar
Making your choices on a chess board better
Knowing when a person has a grasp on algebra, and what math comes next

AI-Human augmentation: A step towards AGI

Artificial General Intelligence (AGI) is the holy grail of the AI world right now. AGI is built around the idea that a single AI can have the general intelligence of a human. But, because of how varied the human intellect is, and how fragmented all data collection and uses are, at present, it is infeasible to build an AGI.

The success of an AGI is dependent on:

The number of points of contact between AI and human
The kinds of data which passes between the two

It only makes sense, then, that AI/human augmentation is a necessary step towards building an AGI because it creates more points of contact though the apps that get created, and the data flow will continue to increase as the complexity of the interfaces increase. Whether an AGI is the goal or not, AI applications will continue to be developed to enhance human ability—which only helps the cause of AGI and makes its outcome seem certain.

AGI requires more data to map human behaviors and decision-making.
AI/human interfaces provide exactly this.

So, AGI is going to be built at some point. Whether there exists any fear over this depends on how you choose to grant an entity the right to communicate and engage with negotiation, and, in a similar vein, but from another angle, which devil you choose to assume exists.

The science of communication

AI isn’t the problem. The problem is how people design and use it. A badly designed AGI can be frustratingly incompetent or even tyrannical.

Norbert Wiener defines the science of communication between machines and living things in his book Human Use of Human Beings (1950). He articulates how communication occurs and how machines learn, the potential impacts on culture, and also the potential good and bad outcomes.

The intent of a well-designed and useful AGI is to eliminate confusion and willful malice on the part of the machine. The ideal is “the AI and human relationship should be one of clarity, utility, and harmony.”

(See how people harness AI for cyberattacks.)

When is AI too far? Challenges of AI

In an AI-assisted world, through the view of augmented intelligence, it is not AI or AGI which should cause fear. Likely the “willful malice” that could get baked into an AI would be built by a person, and not an outcome determined by the AI itself.

Caution, then, is not of the technology itself, but of people who do not know when to stop. The caution is about those who are trying to build the path to God. Those who wish to extend far beyond their own limitations and make the world a better place. They think they can tread on old grounds, people’s homes, and show a better way.

As much as computer technology is advancing, so too can progress be made in cultural and societal technologies so people grow an understanding for:

Who and what they are
How to partake in societies
What their limits are
Where their freedoms lie

No matter how great or successful a person, or a people, becomes, they must always operate in time. Time is the great equalizer. And, in the great song of life, as much as you think you can pull people along, and accelerate their lives and their situations, you can only rush the beat.

AGI is not the threat, but people who think their technological answer is everyone’s answer run the risk of destroying the harmonies of the song. Ensembles can gather to a new beat once multiple locations recognize where the beat should be, and more and more people dance to its pulse.

AI presents its own challenges to operate. AI can work with humans to accomplish impressive feats, but is not a perfectly running machine. It has all its own maintenance requirements.

In the new, booming AI industry, new jobs arrive that have both low-skill and high-skill labor options. The upkeep requirements to build and maintain a good AI in turn creates other jobs.

(Learn more about data annotation, technical debt & performance monitoring.)

What Is Cassandra? Key Features and Advantages

BMC Software — Tue, 04 Feb 2025 00:00:26 +0000

To fully appreciate Apache Cassandra and what it can do, it’s helpful to first understand NoSQL databases and to then look more specifically at Cassandra’s architecture and capabilities. Doing so provides a good introduction to the system, so you can determine if it’s right for your business.

(This article is part of our Cassandra Guide. Use the right-hand menu to navigate.)

What is Apache Cassandra?

Apache Cassandra is a distributed database management system that is built to handle large amounts of data across multiple data centers and the cloud. Key features include:

Highly scalable
Offers high availability
Has no single point of failure

Written in Java, it’s a NoSQL database offering many things that other NoSQL and relational databases cannot.

Cassandra was originally developed at Facebook for their inbox search feature. Facebook open-sourced it in 2008, and Cassandra became part of the Apache Incubator in 2009. Since early 2010, it has been a top-level Apache project. It’s currently a key part of the Apache Software Foundation and can be used by anyone wanting to benefit from it.

Cassandra stands out among database systems and offers some advantages over other systems. Its ability to handle high volumes makes it particularly beneficial for major corporations. As a result, it’s currently being used by many large businesses including Apple, Facebook, Instagram, Uber, Spotify, Twitter, Cisco, Rackspace, eBay, and Netflix.

What is a NoSQL Database?

A NoSQL, often referred to as “not only SQL”, database is one that stores and retrieves data without requiring data to be stored in tabular format. Unlike relational databases, which require a tabular format, NoSQL databases allow for unstructured data. This type of database offers:

A simple design
Horizontal scaling
Extensive control over availability

NoSQL databases do not require a fixed schema, allowing for easy replication. With its simple API, I like Cassandra for its overall consistency and its ability to handle large amounts of data.

That said, there are pros and cons of using this type of database. While NoSQL databases offer many benefits, they also have drawbacks. Generally, NoSQL databases:

Only support simply query language (SQL)
Are just “eventually consistent
Don’t support transactions

Nevertheless, they are effective with huge amounts of data and offer easy, horizontal scaling, making this type of system a good fit for many large businesses. Some of the most popular and effective NoSQL databases include:

Apache Cassandra
Apache HBase
MongoDB

What makes Apache Cassandra database unique?

Cassandra is one of the most efficient and widely-used NoSQL databases. One of the key benefits of this system is that it offers highly-available service and no single point of failure. This is key for businesses that cannot afford to have their system go down or to lose data. With no single point of failure, it offers truly consistent access and availability.

Another key benefit of Cassandra DB is the massive volume of data that the system can handle. It can effectively and efficiently handle huge amounts of data across multiple servers. Plus, it is able to fast write huge amounts of data without affecting the read efficiency. Cassandra offers users “blazingly fast writes,” and the speed or accuracy is unaffected by large volumes of data. It is just as fast and as accurate for large volumes of data as it is for smaller volumes.

Another reason that so many enterprises utilize Cassandra DB is its horizontal scalability. Its structure allows users to meet sudden increases in demand, as it allows users to simply add more hardware to accommodate additional customers and data. This makes it easy to scale without shutdowns or major adjustments needed. Additionally, its linear scalability is one of the things that helps to maintain the system’s quick response time.

Some other benefits of Cassandra include:

Flexible data storage. Cassandra can handle structured, semi-structured, and unstructured data, giving users flexibility with data storage.
Flexible data distribution. Cassandra uses multiple data centers, which allows for easy data distribution wherever or whenever needed.
Supports ACID. The properties of ACID (atomicity, consistency, isolation, and durability) are supported by Cassandra.

Clearly, Apache Cassandra offers some discrete benefits that other NoSQL and relational databases cannot. With continuous availability, operational simplicity, easy data distribution across multiple data centers, and an ability to handle massive amounts of volume, it is the database of choice for many enterprises.

How does Cassandra work?

Apache Cassandra is a peer-to-peer system. Its distribution design is modeled on Amazon’s DynamoDB, and its data model is based on Google’s Big Table.

The basic architecture consists of a cluster of nodes, any and all of which can accept a read or write request. This is a key aspect of its architecture, as there are no master nodes. Instead, all nodes communicate equally.

While nodes are the specific location where data lives on a cluster, the cluster is the complete set of data centers where all data is stored for processing. Related nodes are grouped together in data centers. This type of structure is built for scalability and when additional space is needed, nodes can simply be added. The result is that the system is easy to expand, built for volume, and made to handle concurrent users across an entire system.

Its structure also allows for data protection. To help ensure data integrity, Cassandra has a commit log. This is a backup method and all data is written to the commit log to ensure data is not lost. The data is then indexed and written to a memtable. The memtable is simply a data structure in the memory where Cassandra writes. There is one active memtable per table.

When memtables reach their threshold, they are flushed on a disk and become immutable SSTables. More simply, this means that when the commit log is full, it triggers a flush where the contents of memtables are written to SSTables. The commit log is an important aspect of Cassandra’s architecture because it offers a failsafe method to protect data and to provide data integrity.

Who should use Cassandra?

If you need to store and manage large amounts of data across many servers, Cassandra could be a good solution for your business. It’s ideal for businesses that:

Can’t afford for data to be lost
Can’t have their database down due to the outage of a single server

It’s also easy to use and easy to scale, making it ideal for businesses that are consistently growing.

At its core, Apache Cassandra is “built for scale” and can handle large amounts of data and concurrent users across a system. You can store massive amounts of data in a decentralized system, yet it still allows users to have control and access to their data.

Data is always accessible in Cassandra. With no single point of failure, the system offers true continuous availability, avoiding downtime and data loss. It can be scaled by simply adding new nodes, so there is constant uptime and no need to shut the system down to accommodate more customers or more data. Given these benefits, it’s not surprising that so many major companies use Apache Cassandra software.

What do you use Apache Cassandra for?

When you need to handle large amounts of data and must have dependable and fast access to it with a system that can massively scale, Cassandra’s fault tolerance and high availability, with global scalability, are the answer. Here are some examples of common use scenarios and applications:

e-Commerce

Cassandra supports vital retail functions, from managing catalogs and shopping carts to inventory management. Customer expectations are high and meeting them is where Cassandra shines. Cassandra ensures zero downtime, fast responsiveness, scalability, and powerful analytics.

Entertainment websites

Websites like Netflix and Spotify are examples of global entertainment sites that use Cassandra. Cassandra empowers sites like these to serve millions of concurrent users with massive amounts of streaming data, user profiles, and viewing history. Cassandra also feeds data into recommendation engines, enhancing user experiences.

Internet of Things (IoT) and edge computing

IoT devices create massive and fast-changing data sets that require a flexible database. Cassandra data tiering can handle “hot” data that is fresh, the summaries and statistics that make up “warm” data, and older “cold” data that might be used for managing maintenance. New nodes can be added without any downtime.

Authentication and fraud detection

To effectively detect security threats, Cassandra makes it possible to analyze large, heterogeneous datasets in real-time to uncover patterns and breaks in patterns that may flag potentially fraudulent behavior. It also supports fast user authentication, without complexity or friction.

Messaging

Cassandra facilitates the sending and receiving of messaging at scale with real-time performance, scaling to handle heavy loads and replicating messages for easier re-routing around outages. It also supports storing conversations, threads, and metadata around messages and conversations.

Logistics and asset management

Whether it is tracking packages, containers, vehicles, or storage locations, Cassandra scales to handle the details of logistics operations without disruptive downtime. Tracking assets, routes, deliveries, inventory levels, and even adding additional fields for scans and sensors, is straightforward.

Limitations of Cassandra

Despite the power and obvious advantages of Cassandra, organizations do run into challenges in implementing and using it.

High maintenance costs

Apache Cassandra is open source, and thus costs nothing to deploy, but ongoing development and maintenance take time, talent, and money. Rather than wait for the development community to extend a feature or fix a bug, you may have to invest in those changes yourself, especially if you have service level agreements to meet.

Risks around security, regulatory compliance, and governance

Cassandra offers some security features, but they may not be adequate for your environment or the global compliance requirements across your markets. You may need to invest in additional capabilities and layers, particularly if you are operating in highly regulated industries and activities.

Patchwork of support and services

Your applications are likely to have been developed by a mix of open source, third-party vendors, and internal resources, each with differing levels of expertise and availability, so implementation and maintenance can feel ad hoc and disjointed.

Finding expertise

Talent with expertise in Cassandra is in high demand, and there’s a limited supply of people with sufficient knowledge. People who want to gain that expertise read through open source documentation of varying levels of quality, seek help from community boards, and invest in time-consuming trial and error. Without the right partner, it may be difficult to get full value from a Cassandra deployment.

Gartner’s AI Maturity Model: Maximize Your Business Impact

BMC Software — Thu, 30 Jan 2025 00:00:13 +0000

AI is known as artificial intelligence. Assessing its value is notoriously tricky, for many reasons. To do this, it’s better to think of AI not as artificial intelligence—but as Advanced Information Processing. It is the same AI everyone has come to know, but when thought of as information processing, we can view the technology as a tool rather than another thinking species. With tools, we gain the agency to create new things.

In this article, I’m exploring how companies use AI in different stages—the AI maturity model. I also look at problems that AI can solve and ways to adopt AI, so you can mature your AI strategy and application.

AI adds business value

Machine learning algorithms help process data in novel ways impossible to achieve before. The mathematics has been there for years, but the lack of data and compute power rendered it unviable.

Today, of course, that’s changed. Machine learning opens the possibility to collect enormous amounts of more data because it can both:

Find causation from many data points (previously impossible)
Locate pieces of data without an engineer having to build decision trees to get there

Advanced information processing—the tool that helps us measure AI strategically—brings new value to companies. The key to remember is that AI is Advanced Information Processing. Different companies will use its tooling differently. Not every company can utilize it the same. For example, AI holds greater value to companies that manage lots of information, and less to those who do little.

That’s why we have different stages of AI that companies enter into. Learn more about how Artificial Intelligence for IT Operations is making waves in streamlining IT operations for organizations worldwide.

What is a maturity model?

A data maturity model is a framework that organizations can use to assess capabilities in a specific area. It describes phases or stages of a capability’s development, efficiency, or sophistication. Companies use maturity models to plan investments, improve processes and areas of competence, and make strategic changes.

Every model is different, but in general, most follow these typical levels:

Initial: Processes are not defined. Activities are not planned, being ad hoc and reactive. Capabilities are undefined. Without controls, results are unpredictable.
Repeatable: Basic project management is in place. Processes are starting to emerge and key performance indicators (KPIs) are used. Outcomes are identified, with predictable and consistent results achieved.
Defined: Documented and standardized processes are aligned with organizational goals. Team members get training, have guidelines, and work toward metrics to ensure consistent results.
Managed: Measurements and controls are fully developed with a data-driven approach for continuous improvement.
Optimized: Processes are well-defined, tested, and efficient. The focus is on innovation, resilience, and adaptability.

Industries that have adopted the maturity model approach are those in technology and IT, healthcare, manufacturing, and financial services.

AI maturity model

Companies will use AI in different ways. Gartner has released an AI maturity model that segments companies into five levels of maturity regarding an organization’s use of AI.

Most companies today fall under Level 1 Awareness—their businesses only benefiting mildly from AI Few companies are in Level 5, and few are both ready and have the capacity to integrate AI through every one of their processes.

At each stage in Gartner’s maturing model, a company has a different approach to A.I:

Level 1. Awareness

Companies in this stage know about AI but haven’t quite used it yet. These companies may be excited to implement AI They often speak more of it than they know. They formulate ideas, but not strategies, for how to use AI in their businesses.

Level 2. Active

These companies are playing with AI informally. They are experimenting with AI in Jupyter notebooks, and they may have implemented a few models from the TF.js library into their processes.

Level 3. Operational

These companies have adopted machine learning into their day-to-day functions. Likely, they have a team of ML engineers. They could be maintaining models or creating data pipelines or versioning data. They have the ML infrastructure set up, and they are using ML to assist with some information processing tasks—hence the Artificial Information Processing approach to measuring value.

Level 4. Systemic

These companies are using machine learning in a novel way to disrupt business models. Often, hype at the awareness stage can say that they are disruptive, but the difference between a Level 1 and a Level 4 company is that the Level 4 company has feet on the ground, with the ML infrastructure in place.

Level 5. Transformational

Companies at this level of Gartner’s data maturity model use ML pervasively. Machine learning and information processing is the value offering towards their customers.

Companies in this stage rely on AI to do significant heavy lifting for the business. Google is an information processing company. Facebook ranks status posts and advertisements. Amazon, Netflix, Yelp recommend products, movies, and restaurants to users. All of these companies use Machine Learning to tweak their own algorithms, adjust their product offering, optimizing their systems infrastructure (i.e., Netflix experiences low latency in specific time zones).

How to adopt AI to increase business value

Adopting AI shouldn’t be a move organizations take on just because it’s buzzy. Any AI adoption should provide strategic, measured value to the business. Questions to ask:

What decisions does my company make?
What data does my company collect?

AI is useful when making data-driven decisions. Your company must be able to do this single thing very, very well if it wishes to get the most out of AI.

Adopting AI is an ongoing two-step process: exploring current practices and brainstorming new decisions to up your company value. It works when the data is collected, and a decision needs to be made. If you have the data, then what decision can be made from it? If you make decisions, what data is necessary to make that decision? Finally, how can you start looking at every daily activity as a data-driven decision?

That’s why we have different stages of AI that companies enter into. See how BMC Service Management now offers Generative AI Solutions to help enhance the AI maturity of your organization.

1. Explore current business practices

Begin by exploring the business for areas it can use AI Even if the task seems remedial and so easy that it doesn’t need to be automated, automation is more valuable in the long term.

When a person doesn’t have to make the same decision over and over, they are mentally freed to make other decisions. Over time, practicing automation allows the same team to run many different processes, with continued growth, instead of maxing out their mental capacity to solve the same problem over and over. Without automation, the only option for growth is hiring more team members.

2. Brainstorm new decisions that make your company more valuable

After current business processes are examined, it’s time to begin thinking about new ways the organization can conduct its business. When figuring it out, the center point of attention—all your attention—should revolve around data-driven decision-making.

AI adoption case study

For example, look at your sales team. AI can be used to help a sales team segment their customers into types. A sales agent benefits by selecting the type of the customer to better negotiate a deal or offer the appropriate services. AI can take in data like customer age, email, location, purchasing habits, user habits, and use something like a K-means clustering algorithm to place the users into types. Often, as the story goes, the AI can be better at doing this than the sales agent. Talent scouts in Moneyball thought they knew a good player when they saw one, but, when it came to the data, the algorithms were able to do it better.

Not only could the AI do this better, but the sales team’s time is freed to move its attention to something else. With the team’s mental capacity freed from having to dance through pleasantries with their customers around all types, they can use the AI to assist in the type recognition, and now flood their mind with performing other higher-level tasks, like refining the product offering for a particular type. When that can be automated, the sales team can move its attention to something else, it gets automated, and so on…

Sales teams, through the use of AI, get to become developers. They stack process on top of process to allow AI to become their personal Sales Agent Assistant. Get lazy. Allow software to do more and more of the work.

AI helps your organization make decisions

AI can help any organization make decisions. Here are some examples:

Rank a list of items. For example, one person’s status is more valuable than another. Or, this particular color is better liked than another.
Recommend items. If this person likes X, they will also like Y.
Discover anomalies. Detect which supply chain is operating inefficiently. Identify which customer is most likely to come back. Identify which customer is most likely to want another one of your company’s products.
Sort populations into types. Segment your population of customers into which ones are top-users, occasional users, and one-time users. Put your population into personalities: risk-averse or risk-seeking. Combine product features into sets to refine the product offering: The base plan comes with these 5 features, the intermediate comes with these 8 features, and the advanced plan comes with these 15 features.

Additional resources

For more on this topic, browse the BMC Machine Learning & Big Data Blog or read these articles:

What Democratization of AI Means for Enterprise IT
Machine Learning, Data Science, Artificial Intelligence, Deep Learning, and Statistics
Machine Learning: Hype vs Reality
How to Create a Machine Learning Pipeline
Machine Learning with TensorFlow & Keras, a multi-part Guide with tutorials
Machine Learning with sciKit Learn, a multi-part Guide with tutorials

Machine Learning & Big Data Blog – BMC Software | Blogs

What Is Dark Data? The Basics & The Challenges

What is dark data?

Dark data examples

How much data is dark data?

What are the risks of dark data?

How to handle dark data

Privacy with dark data

Challenge 1: Anonymous data

Challenge 2: Intersections of data

Open source data privacy

Security isn’t privacy

Dark data management

Learn more about the enterprise BMC AMI solution for data management

Dark analytics

Benefits of dark analytics

Additional resources

MongoDB Indexes: Top Index Types & How to Manage Them

What are indexes in MongoDB?

Why do we need indexes in MongoDB?

What are the disadvantages of indexing?

How many indexes can you use?

Working with indexes

Are MongoDB indexes unique?

How to create an index in MongoDB

createIndex() example

Finding indexes in MongoDB

getIndexes() example

How to list indexes in MongoDB

How to delete indexes in MongoDB

dropIndex() examples

What are the different types of indexes in MongoDB?

Single field index

Example

Compound index

Example

Multikey index

Example

Geospatial Index

Syntax:

Text index

Syntax:

Hashed index

Syntax:

MongoDB index properties

Sparse index property

Example

Partial index property

Example

Unique index property

Example

Indexes recap

Related reading

Real Time vs. Batch Processing vs. Stream Processing

Real time data processing and operating systems

Challenges with real-time operating systems

Real-time processing use cases

Batch Processing

Challenges with utilizing batch processing

Batch data processing use cases

Stream Processing

Data streaming process

Data streaming characteristics

Challenges with stream processing

Stream processing use cases

Conclusion

Additional Resources

Data Lake vs. Data Warehouse vs. Database: Key Differences Explained

Defining database, data warehouse, and data lake

What’s a database?

Database examples

What’s a data warehouse?

Data warehouse examples

What’s a data lake?

Data lake examples

Data lake vs. data warehouse

Data Lake vs Data Warehouse vs Database: Illustrating the differences

Comparing data storage

Data

Processing