Quantcast
Channel: NoSQL – Cloud Data Architect
Viewing all 521 articles
Browse latest View live

Shows w/ MySQL this week

$
0
0

Feed: Planet MySQL
;
Author: Oracle MySQL Group
;

Just a friendly reminder for the busy week we have ahead of us… Please find below the shows you can find our MySQL staff at:

  • First show where you can fid us at is the Oracle Code in Shenzen, China 2019, April 16, 2019 
    • ​JSON is one of the flexible data for data exchange and storage today. MySQL X-DevAPI introduces a new modern and easy-to-learn way to work with JSON and Relational data. 
    • Do not miss the MySQL session scheduled for 4:05pm – 4:50pm as follows:
      • “NoSQL @ MySQL – Managing JSON Data with reliable and secured MySQL Database” by Ivan Ma, from Oracle MySQL Team & Zhou Yin Wei – Oracle MySQL ACE Director
  • ​Second show is OpenSource 101, Columbia, SC, US, April 18, 2019
    • ​MySQL & Oracle Back-end IT solutions group are together attending this show. Find us at the shared booth in the expo area as well as do not miss following MySQL talk:
      • “MySQL 8.0 New Features” by David Stokes, the MySQL Community Manager (Apr 16@1:30-2:15pm, 1C Conference Room)
  • ​Last conference with MySQL is OpenSource Conference Okinawa, Japan, April 20, 2019
    • ​Do not miss the MySQL talk focused on the newest trends in MySQL development and demonstration of MySQL Document Store with Java App. The talk is given by Yoshiaki Yamasaki, the MySQL GBU.
    • Come to talk to us at our MySQL booth in the expo area as well!

‘);
}
else {
var loc = middle + 1;
$(“p:eq(“+loc+”)”).append(”);
}
}
else if( location == “end” ) {
$(“p:eq(“+end+”)”).append(”);
}
else {

}
}

function displayAds(adgroup1,adgroup2,adgroup3)
{
if( count > 17 )
{
if( adgroup1 != “NONE PROVIDED” )
{
if( $(“p:eq(4)”).text().length > 100) {
$(“p:eq(4)”).append(”);
}
else {
$(“p:eq(5)”).append(”);
}
}
if( adgroup2 != “NONE PROVIDED” )
{
if( $(“p:eq(10)”).text().length > 100) {
$(“p:eq(10)”).append(”);
}
else {
$(“p:eq(11)”).append(”);
}
}
if( adgroup3 != “NONE PROVIDED” )
{
if( $(“p:eq(16)”).text().length > 100) {
$(“p:eq(16)”).append(”);
}
else {
$(“p:eq(17)”).append(”);
}
}
}
else if( count > 11 )
{
if( adgroup1 != “NONE PROVIDED” )
{
if( $(“p:eq(4)”).text().length > 100) {
$(“p:eq(4)”).append(”);
}
else {
$(“p:eq(5)”).append(”);
}
}
if( adgroup2 != “NONE PROVIDED” )
{
if( $(“p:eq(10)”).text().length > 100) {
$(“p:eq(10)”).append(”);
}
else {
$(“p:eq(11)”).append(”);
}
}
}
else if( count > 5 )
{
if( adgroup1 != “NONE PROVIDED” )
{
if( $(“p:eq(4)”).text().length > 100) {
$(“p:eq(4)”).append(”);
}
else {
$(“p:eq(5)”).append(”);
}
}
}

}


Announcing general availability of Apache Hadoop 3.0 on Azure HDInsight

$
0
0

Feed: Microsoft Azure Blog.
Author: Arindam Chatterjee.

imageToday we’re announcing the general availability of Apache Hadoop 3.0 on Azure HDInsight. Microsoft Azure is the first cloud provider to offer customers the benefit of the latest innovations in the most popular open source analytics projects, with unmatched scalability, flexibility, and security. With the general availability of Apache Hadoop 3.0 on Azure HDInsight, we are building upon existing capabilities with a number of key enhancements that further improve performance and security, and deepen support for the rich ecosystem of big data analytics applications.

Bringing Apache Hadoop 3.0 and supercharged performance to the cloud

Apache Hadoop 3.0 represents over 5 years of major upgrades contributed by the open source community across key Apache frameworks such as Hive, Spark, and HBase. New features in Hadoop 3.0 provide significant improvements to performance, scalability, and availability, reducing total cost of ownership and accelerating time-to-value.

  • Apache Hive 3.0 – With ACID transactions on by default and several performance improvements, this latest version of Hive enables developers to build “traditional database” applications on massive data lakes. This is particularly important for enterprises who need to build GDPR/privacy compliant big data applications.
  • Hive Warehouse Connector for Apache Spark – With the Hive Warehouse Connector, the Spark and Hive worlds are coming closer together. The new connector moves the integration from the metastore layer to the query engine layer. This enables higher, more reliable performance with predicate pushdown and other functionality.
  • Apache HBase 2.0 and Apache Phoenix 5.0 – Apache HBase 2.0 and Apache Phoenix 5.0 introduce a number of performance, stability, and integration improvements. With HBase 2.0, periodic reorganization of the data in the memstore with in-memory compactions improves performance as data is not flushed or read too often from remote cloud storage. Phoenix 5.0 brings more visibility into queries with query log by introducing a new system table that captures information about queries that are being run against the cluster.
  • Spark IO Cache – IO Cache is a data caching service for Azure HDInsight that improves the performance of Apache Spark jobs. IO Cache also works with Apache TEZ and Apache Hive workloads, which can be run on Apache Spark clusters.

Enhanced enterprise grade security

Enterprise grade security and compliance is a critical requirement for all customers building big data applications that store or process sensitive data in the cloud.

  • Enterprise Security Package (ESP) support for Apache HBase – With the general availability of ESP support for HBase, customers can ensure that users authenticate to their HDInsight HBase clusters using their corporate domain credentials and are subject to rich, fine-grained access policies (authored and managed in Apache Ranger).
  • Bring Your Own Key (BYOK) support for Apache Kafka – Customers can now bring their own encryption keys into the Azure Key Vault and use them to encrypt the Azure Managed Disks storing their Apache Kafka messages. This gives them a high degree of control over the security of their data.

Rich developer tooling

Azure HDInsight offers rich development experiences with various integrated development environment (IDE) extensions, notebooks, and SDKs.

  • SDKs general availability – HDInsight SDKs for .NET, Python, and Java enable developers to easily manage clusters using the language of their choice.
  • VSCodeHDInsight VSCode extension enables developers to submit Hive batch jobs, interactive Hive queries, and PySpark scripts to HDInsight 4.0 clusters.
  • IntelliJAzure Toolkit for IntelliJ enables Scala and Java developers to program Spark, Scala, and Java projects with built-in templates. Developers can easily perform local run, local debug, open interactive sessions, and submit Scala/Java projects to HDInsight 4.0 Spark clusters directly from the IntelliJ integrated development environment.

Broad application ecosystem

Azure HDInsight supports a vibrant application ecosystem with a variety of popular big data applications available on Azure Marketplace, covering scenarios from interactive analytics to application migration. We are excited to support applications such as:

  • Starburst (Presto) – Presto is an open source, fast, and scalable distributed SQL query engine that allows you to analyze data anywhere within your organization. Architected for the separation of storage and compute, Presto can easily query data in Azure Blob Storage, Azure Data Lake Storage, SQL and NoSQL databases, and other data sources. Learn more and explore Starburst Presto on Azure Marketplace.
  • Kyligence – Kyligence is an enterprise online analytic processing (OLAP) engine for big data, powered by Apache Kylin. Kyligence enables self-service, interactive business analytics on Azure, achieving sub-second query latencies on trillions of records and seamlessly integrating existing Hadoop and BI systems. Learn more and explore Kyligence on Azure Marketplace.
  • WANDisco – WANDisco Fusion de-risks migration to the cloud by ensuring disruption-free data migrations, easy and seamless extensions of Spark and Hadoop deployments, and short or long term hybrid data operations. Learn more and explore WANDisco on Azure Marketplace.
  • Unravel Data – Unravel provides a unified view across your entire data stack, providing actionable recommendations and automation for tuning, troubleshooting, and improving performance. The Unravel Data app uses Azure Resource Manager, allowing customers to connect Unravel to a new or existing HDInsight cluster with one click. Learn more and explore Unravel on Azure Marketplace.
  • Waterline Data – With Waterline Data Catalog and HDInsight, customers can easily discover, organize, and govern their data, all at the global scale of Azure. Learn more and explore Waterline on Azure Marketplace.

Get started now

We look forward to seeing what innovations you will bring to your users and customers with Azure HDInsight. Read the developer guide and follow the quick start guide to learn more about implementing open source analytics pipelines on Azure HDInsight. Stay up-to-date on the latest Azure HDInsight news and exciting features coming in the near future by following us on Twitter (#AzureHDInsight). For questions and feedback, please reach out to AskHDInsight@microsoft.com.

About Azure HDInsight

Azure HDInsight is an enterprise-ready service for open source analytics that enables customers to easily run popular Apache open source frameworks including Apache Hadoop, Spark, Kafka, and others. The service is available in 30 public regions and Azure Government Clouds in the US and Germany. Azure HDInsight powers mission critical applications for a wide range of sectors and use cases including ETL, streaming, and interactive querying.

Webinar: How Kafka and MemSQL Deliver Intelligent Real-Time Applications

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Using Apache Kafka and MemSQL together makes it much easier to create and deliver intelligent, real-time applications. In a live webinar, which you can view here, MemSQL’s Alec Powell discussed the value that Kafka and MemSQL each bring to the table, shows reference architectures for solving common data management problems, and demonstrates how to implement real-time data pipelines with Kafka and MemSQL.

Kafka is an open source messaging queue that works on a publish-subscribe model. It’s distributed (like MemSQL) and durable. Kafka can serve as a source of truth for data across your organization.

What Kafka Does for Enterprise IT

Today, enterprise IT is held back by a few easy to identify, but seemingly hard to remedy, factors:

  • Slow data loading
  • Lengthy query execution
  • Limited user access

Kafka and MemSQL share complementary characteristics such as

These factors interact in a negative way. Limited data messaging and computing capabilities limit user access. The users who do get on suffer from slower data loading and lengthy query execution. Increasing organizational needs for data access – for reporting, business intelligence (BI) queries, apps, machine learning, and artificial intelligence – are either blocked, preventing progress, or allowed, further overloading the system and slowing performance further.

Organizations try a number of fixes for these problems – offered by both existing and new vendors, usually with high price tags for solutions that add complexity and provide limited relief. Solutions include additional CPUs and memory, specialized hardware racks, pricey database add-ons, and caching tiers with limited data durability, weak SQL coverage, and high management costs and complexity.

NoSQL solutions offer fast ingest and scalability. However, they run queries slowly, demand limited developer time for even basic query optimization, and break compatibility with BI tools.

How MemSQL and Kafka Work Together

MemSQL offers a new data architecture that solves these problems. Unlike NoSQL solutions, MemSQL offers both scalability – which affords extreme performance – and an easy-to-use SQL architecture. MemSQL is fully cloud-native; it is neither tied to just one or two cloud platforms, nor cloud-unfriendly, as with most alternatives.

MemSQL is a good citizen in all kinds of modern and legacy data management deployments.

In the webinar, Alec shows how MemSQL works. Running as a Linux daemon, MemSQL offers a fully distributed system, and is cloud-native – running in the cloud and on-premises, in containers or virtual machines, and integrating with a wide range of existing systems. Within a MemSQL cluster, an aggregator node communicates with the database client, manages schema, and shares work across leaf nodes. (A master aggregator serves as a front-end to multiple aggregator nodes, if the scale of the database requires it.)

MemSQL runs multiple aggregator and leaf nodes to distribute work fully across a cluster.

MemSQL Pipelines integrate tightly with Kafka, supporting the exactly-once semantics for which Kafka has long been well-known. (See the announcement blog post in The New Stack: Apache Kafka 1.0 Released Exactly Once.) MemSQL polls for changes, pulls in new data, and executes transactions atomically (and exactly once). Pipelines are mapped directly to MemSQL leaf nodes for maximum performance.

MemSQL Pipelines take data from Kafka, et al, optionally transform it, and store it - fast.

Together, Kafka and MemSQL allow live loading of data, which is a widely needed, but rarely found capability. Used with Kafka, or in other infrastructure, MemSQL handles mixed workloads and meets tight SLAs for responsiveness – including with streaming data and strong demands for concurrency.

Kafka-MemSQL Q&A

There was a lively Q&A session. The questions and answers here include some that were handled in the webinar and some that could not be answered in the live webinar because of time constraints.

Q. Can Kafka and MemSQL run in the cloud?
A. Both Kafka and MemSQL are cloud-native software. Roughly half of MemSQL’s deployments today are in the cloud; for instance, MemSQL often ingests data from AWS S3, and has been used to replace Redshift. The cloud’s share of MemSQL deployments is expected to increase rapidly in the future.

Q. Can MemSQL replace Oracle?
A. Yes, very much so – and other legacy systems too. Because of the complexities of many data architectures, however, MemSQL is often used first to augment Oracle. For instance, customers will use a change data capture (CDC) to copy data processed by Oracle to MemSQL. Then, analytics run against MemSQL, offloading Oracle (so transactions run faster) and leveraging MemSQL’s faster performance, superior price-performance, scalability, and much greater concurrency support for analytics.

Q. How large can deployments be?
A. We have customers running from the hundreds of megabytes up into the petabytes.

Q. With MemSQL Pipelines, can we parse JSON records?
A. Yes, MemSQL has robust JSON support.

Summing Up Kafka+MemSQL and the Webinar

In summary, MemSQL offers live loading of batch and streaming data, fast queries, and fully scalable user access. Together, Kafka and MemSQL remove access to data streaming and data access right across your organization. You can view the webinar now. You can also try MemSQL for free today or contact us to learn how we can help support your implementation plans.

Real Time Analytics and Stream Processing

$
0
0

Feed: Featured Blog Posts – Data Science Central.
Author: Ajit Singh.

In many business scenarios it is no longer desirable to wait hours, days or weeks for the results of analytic processes. Psychologically, people expect real-time or near real-time responses from the systems they interact with. Real-time analytics is closely tied to infrastructure issues and recent move to technologies like in-memory databases is beginning to make ‘real-time’ look achievable in the business world and not just in the computer science laboratory.

Handling large amounts of streaming data, ranging from structured to unstructured, numerical to micro-blogs streams of data, is challenging in a Big Data context because the data, besides its volume, is very heterogeneous and highly dynamic. It also calls for scalability and high throughput, since data collection related to a disaster area can easily occupy terabytes in binary GIS formats and data streams can show bursts of gigabytes per minutes.

The capabilities of existing system to process such streaming information and answer queries in real-time and for thousands of concurrent users are limited. Approaches based on traditional solutions like Data Stream Management Systems (DSMS) and Complex Event Processors (CEP), are generally insufficient for the challenges posed by stream processing in a Big Data context: the analytical tasks required by stream processing are so knowledge-intensive that automated reasoning tasks are also needed.

The problem of effective and efficient processing of streams in a Big Data context is far from being solved, even when considering the recent breakthroughs in noSQL databases and parallel processing technologies
A holistic approach is needed for developing techniques, tools, and infrastructure which spans across the areas of inductive reasoning (machine learning), deductive reasoning (inference), high performance computing (parallelization) and statistical analysis, adapted to allow continuous querying over streams (i.e., on-line processing).

One of the most open Big Data technical challenges of primary industrial interest is the proper storage/processing/management of huge volumes of data streams. Some interesting academic/industrial approaches have started to mature in the last years, e.g., based on the Map Reduce model to provide a simple and partially automated wa6y to parallelize stream processing over cluster or data centre computing/storage resources. However, Big Data stream processing often poses hard/soft real-time requirements for the identification of significant events because their detection with a too high latency could be completely useless.

New Big Data-specific parallelization techniques and (at least partially) automated distribution of tasks over clusters are crucial elements for effective stream processing. Achieving industrial grade products will require:

1. New techniques to associate quality preferences/requirements to different tasks and to their interworking relationships;

2. New frameworks and open APIs for the quality-aware distribution of stream processing tasks, with minimal development effort requested by application developers and domain experts.

Transforming the Enterprise: AI at Scale with Neo4j

$
0
0

Feed: Neo4j Graph Database Platform.
Author: Jocelyn Hoppa.
Editor’s Note: This presentation was given by Michael Moore and Omar Azhar at GraphConnect New York in September 2018.

Presentation Summary

EY has embarked upon an ambitious graph enterprise AI and machine learning initiative to more effectively uncover fraudulent activities for their customers. They’ve chosen graph technology for this purpose because they are simple visual constructs, a natural fit for semantic information and speed up the development process for enterprises. The technology is becoming increasingly popular because the cost of computing has fallen precipitously over the last two decades, and graphs provide the perfect scale-up solution for cloud-based storage.

Almost all use cases that benefit from graph technology involve connecting data across multiple domains with processes that rely on relationships and dependencies to uncover patterns. This includes solving the challenging customer 360 use case, which relies on a graph recommendation engine; the B2B use case, which relies on efficient Master Data Management; and the financial use case, which relies on the identification of abnormal patterns.

For financial use cases, EY is working towards applying AI and machine learning to the output from their graph models to assign additional context, inferences or domain knowledge that help companies make better and more efficient decisions. And there’s no better tool than a knowledge graph to achieve these goals.

Full Presentation: Transforming the Enterprise: AI at Scale with Neo4j

What we’re going to be talking about today is how Ernst & Young (EY) is preparing to take advantage of graph enterprise-AI and machine learning at scale:

Michael Moore: We’re going to start with a basic overview of Neo4j, graphs and big data, including why graphs are becoming more and more popular.

Then we’re going to turn to the graph use cases and schemas that we typically see with our clients, followed by a deep dive on what we consider to be the frontier of AI: using graphs to derive additional context, and then leveraging that context to build better models.

We’ll conclude our presentation with a few tips on getting started.

The Growing Popularity of Graphs

We have a massive analytics practice at EY with thousands of consultants deployed in various sectors. We think the graph database is a transformative technology, and that the next generation of analytics will be based on machine reasoning derived from graphs. We also estimate that in about a decade, 50 percent of all SQL workloads will be executed on graphs.

So where do graphs fit in to the database landscape?

On the left side of the above graphic, we have the traditional databases and data warehousing that was designed 20 or 30 years ago when compute power was at a premium.

Other databases in the NoSQL universe include document databases like MongoDB, key value databases like redis for limited high speed querying, and wide column stores like Apache Hbase.

And then you have Neo4j, which is the most enterprise-ready of all of the graph databases, and which can talk to all of the above databases.

Why are graphs becoming so popular? Because over the last 25 years, the cost of computing has been falling precipitously:

This has driven some amazing opportunities. Last year, AWS started offering single machines with four terabytes of memory for a little more than $10 an hour. This allows an entire data fabric to sit in a single environment with incredibly fast performance, especially compared to a network of distributed commodity machines. Microsoft recently Microsoft released Azure, which offers virtual machines with up to 12 TB of memory.

Five years from now we could have cloud-based servers with 100 terabytes of memory, or maybe even a petabyte of memory – and Neo4j provides the perfect scale-up solution.

Even though graph databases only make up about 1.3 percent of the database market share, they are driving the most growth and interest from our community of practitioners:

Ultimately, I think big RAM and big graphs are going to eat up the traditional relational database space. Relational databases still have a place for things like ERP systems, but today, companies are competing on speed and relevance. To be relevant, you have to know a lot about the context, which includes views that cut across a large number of data domains – a function performed excellently by graphs.

What Is a Graph?

Let’s answer the important question: What is a graph? It’s a simple visual construct that has been around for a long time. And graphs are a natural fit for semantic representation that’s easy to develop and understand.

Below is a graph of a single email from the eCommerce ecosystem:

In this graph, which actually represents billion-dollar businesses like eBay, we send emails to customers to drive them to our website and buy our products, which we want to ensure is actually in stock.

We need to be able to answer the following important business question: How good are we at getting people to visit our website and buy our product?

To answer this question, we run analytics and build a graph, which presents the same way in the database as it is drawn above:

This provides the ability to have much richer conversations with your business leaders, and find out: Did we get this idea right? Or did we miss something?

Leaders can look directly at this schema and let us know whether or not we’ve left out an important process or step. The ability to have these conversations quickly and easily speeds up our development.

Traditional databases rely on keys and tables, and you have to be a bit like a bricklayer. To do a query, you have to look at huge data tables, figure out what keys need to be joined, and write the corresponding query. Anyone who has done this kind of work knows you can’t run a query that goes past three or four tables, largely because you consume a huge volume of runtime memory.

Compare this to graph databases, with searches that are a bit more like a snake going through grass. Let’s take a look at the Cypher query in the top right of the above slide.

We want to match all the emails sent to Steve that drove him to visit the website and purchase a product. This single statement replaces an SQL correlated subquery in a very compact way that also represents a traversal path. When the database conducts a search, for every pattern where this path is true, the query will return a row of data. And the great thing about Cypher is that a very simple query allows you to run interesting and sophisticated queries.

Graph Database Use Cases

There are a huge number of use cases in the graph space, with more showing up every day:

Data lakes are great at ingesting data, typically unstructured data, and there’s a lot of great work that can be done in this space. They have entire snapshots of huge Oracle databases that maybe haven’t been explored, or they can contain really important conformed and curated data. You can also use Kafka Streams or Kinesis Streams to continually flow this data into that warehouse.

So while data lakes are really great at ingesting data, they’re not good at the syndication, distribution or export of data out to the edges in a common format.

But placing a graph over data lakes provides a common data fabric that allows you to pull up and connect important data, and through the use of Neo4j APIs, you can drive a number of applications. You can also write your own Java for some of the more difficult tasks, which is something we do for a number of our clients.

Neo4j is also fully extensible, and works for use cases across marketing, risk, data governance, sales and marketing analysis and account coverage. In all of these scenarios, you could have a business leader who has been searching for a particular view that their current data structure just doesn’t provide. In your organization that’s a cry for help – and normally points to needing a graph.

The Customer 360 Use Case

Let’s dive into a few examples, starting with the customer 360. Below is an example schema, which has our customer in the middle:

Customer 360 view is our most common use case because it’s very difficult to do.

You have customer segmentation data, marketing touches across channels, product purchasing and hierarchies, support tickets, transactions and tender methods, purchase channels, account stack and history, login information, billing addresses – and the list goes on.

It’s particularly difficult to do this in an enterprise because there are typically a dozen or more applications managing each chunk of this heterogeneous data landscape. A graph is really the only tool that can readily pull all of this data together and rationalize it.

Once you have your customer 360 view in place, we build recommendation engines and perform advanced graph analysis.

Here’s an example of an analytical graph built on top of a customer 360 graph:

Each node is a product node, and the co-purchasing probability for every pair of products has been computed in the graph. For each of our thousands of products, we’ve performed a pairwise calculation for every possible combination. The graph calculation answers the question: With what frequency are two items purchased together?

The following demo walks through the customer 360 for an online shopping database:

The below are the typical steps you might follow in a graph recommendation engine:

You build your data graph, your customer 360 graph, and all the frequency associations. Amazon credits 25 to 35 percent of their revenue to their recommendation engine, which only relies on two main relationships: customers who bought Item A also bought Item B, and customers who searched for Item C typically buy Item D. That’s it – two relationships and millions of products that are generating a huge amount of value.

Typically, you go through a set of steps that includes discovery scores and exclusions that eliminate products that are out of stock. You could also potentially boost a product because it’s being marketed, or do some final checking in the post stage to ensure you have good category and diversity coverage.

The B2B Use Case

Enterprise case studies related to the B2B customer 360 or account 360 are quite complicated:

This complexity is in part due to the fact that most businesses are amalgams of businesses that have been acquired, which results in graphs like the above.

For example, we have clients with multiple divisions that each have their own legacy sales teams and multiple Salesforce instances. When you try to do something simple like email a customer, we don’t know which customer or which email address to use because we may have four or five representations of that single customer across our systems. This relates to master data management (MDM) challenges.

In the above slide we have our customer (a business) which has contacts who are people. We explode all of the different identity elements for each one of those contacts and track the source of each identity element. This could be anything from a user-completed form, a third party, or a Salesforce record. Each of these sources will have different levels of authority relative to the construction of an end or golden record. You can assign the “probability of authority” as a relationship, dynamically query the graph, and return the most up-to-date golden record possible.

Below is the typical architecture for this type of graph:

At the lower levels you might have a semantic layer where you reconcile precisely the different field names across the different divisions that are contributing to this kind of a graph, and then you bring it up into Neo4j and hook it up to your applications.

The Financial Use Case: Uncovering the Rare and Inefficient Patterns

All our prior examples have been on the topic of common patterns, something graphs are very good at interpreting and recommending. But what about the rare patterns? Graphs are also good at finding fraud, collusion and money laundering through the identification of rare patterns.

This type of graph pattern represents a violation of what you consider to be the canonical subgraph for whatever entity you’re looking at, which is especially helpful in the financial sector.

Consider the following money laundering example:

Let’s start with the green panel. We have a company, which has a bank account, which performs a transaction to send money to a beneficiary account, which is owned by a beneficiary person. All of this looks perfectly legitimate.

But if we move to the red panel and try to link these same elements, we might see that this company has a director that works for another company with an address very similar to the company address where money was just wired to. This uncovers fraud like embezzlement, a very common use case.

Below is a representation of a generalized transaction schema:

I have transactions, and relationships that follow the directional flow of funds, accounts, parties and their associated information. All of these things can be interrogated to understand the potential legitimacy of a transaction.

Below is an example based on some real data we worked with earlier this year:

We modeled transactions as account months, built a subgraph for each of a client’s account months, and received an alert and a suspicious activity report for this particular account. We uncovered multiple transfers of $10,000 for a total of $70,000 in a single month, without any meaningful increase in the account balance. This was clearly an example of what’s known as pass-through money laundering.

The above shows a report for only a single month. When we looked back over additional historical data, we discovered that this pattern of behavior had been consistent for about 16 months:

Even though some alerts were thrown off, none of them were considered significant enough for follow-up. But in the graph, it became clear that this is a sustained pattern of money laundering.

Omar: We’re going to continue with this money laundering example, but from a more contextual perspective. Let’s start with a bit of background in the financial services sector:

From a data science machine learning perspective, we generally see that a lot of financial use cases are driven by a few core competencies capabilities:

    • Transcription and information extraction, which converts all unstructured data into machine-readable formats, whether it’s call centers, calls, texts or onboarding documentation
    • Natural language processing
    • Knowledge graphs, which allow you to put your data in a graph topology with a machine reasoning inference, are all new capabilities that enable a lot of machine learning use cases. You’ll see how we do this for money laundering.
    • Fundamental machine learning and deep learning. We separate the two because we often see that a lot of general business use cases can be solved with good data science and basic machine learning.

The enabler for all of these competencies are big data platforms, particularly Neo4j.

The Third Wave of AI: Context and the Knowledge Graph

For more context in the financial money laundering space, we can look towards the Defense Advanced Research Projects Agency (DARPA):

The first wave of AI includes rules and decision-making trees, which includes if/then statements used for tools such as consumer tax software. The second wave of AI is what we see in many current AI tools, which is statistical-based learning for pattern matching. You throw in a lot of data, it understands the distribution of the features, cuts it across and then gives you an output. This is used for tools like facial recognition software or any classification type problem.

The third wave that DARPA is focusing on is contextual adaptation. This takes the output from your second wave models and embeds additional context, inferences or domain knowledge, and then gives that information to the decision makers to actually make the decisions.

This means that it’s no longer enough to build a statistical model, get an output and make decisions based only on that. You also need to build a system or design your data in such a way that it can provide the reason for why those decisions or statistical outputs are correct, or can provide the context to do so.

And if we were to define the third wave of AI as requiring context, there’s no better data model than a knowledge graph to provide that context.

These types of graphs let you connect all your data, concepts, ideas, entities and ontologies into one very densely connected structure. Those relationships are essentially what drives contextual learning and most of the context for you to make those decisions. The more connections and relationships you make, the stronger your knowledge graphs – and the more dense the information you have at your fingertips.

Let’s walk through an example using a large bank with multiple lines of business, credit cards, mortgages and loans, with a focus on only one.

Each line of business has a variety of different operations, and this particular customer has a retail deposit account with my bank. I’m making decisions on that customer based on information only from their retail account, which provides relatively little context because this person is likely making a lot of other financial actions separate from this deposit account.

To provide more context, I can start connecting and building a classic customer 360 view using a graph database. This allows us to connect all the data we have on that customer across all the products my institution offers:

In this example, I know the customer also has a credit card and loans with my financial institution. This allows me to make decisions with a lot more context.

You can take this a little further by comparing this customer to another customer in your bank with similar types of products:

In this case, the customers are co-signers on a mortgage, which allows you to infer an additional relationship between these two customers: they likely live together. And by inferring this relationship, I’ve created an additional context in my graph. Instead of looking at each customer individually, I can start making decisions about these customers as a household. With more relationships, I have more context for my decision-making.

Using Neo4j as the database’s knowledge graph helps create that data model to surface this context all at once, which helps you make better-informed decisions.

Below is what our model looks like:

In this customer 360 use case, the graph database runs across your multiple data silos to surface all that data up for your various different decision-making or application layers. With all data in this single graph, there’s an entirely new slew of mathematics, topology and machine reasoning concepts that are now available.

New Context with Implied Relationships

This brings us to the difference between explicit and implicit knowledge. Explicit knowledge refers to the data that already exists in our database. Having all your data in one place is the first milestone, and one that many businesses haven’t reached yet:

Once you reach this milestone, the next step from an AI perspective is performing machine reasoning and inference. This includes using reasoning to add knowledge that doesn’t currently exist in your data sources:

In this topic diagram, red goes to purple, purpose goes to green, and green goes to yellow. Therefore we can infer that there may be some sort of relationship between red and yellow.

So what does this mean in a real-world use case? If I have person A and they’re transacting with person B, I can infer that there’s likely a relationship between person A and person B. I can then put this into my graph:

I can take that even further by applying different information theory and network science methods to calculate the numeric probability that if personA transacts with person B and person B transacts with person C, person A and person C have a relationship.

You can start with very simple business-type logic for connecting your customers or your products together in this way, and can continue building on this through a new world of mathematics that’s becoming more readily available for machine learning and AI applications.

Now let’s go through a quick overview of some of the pain points in our anti-money laundering cases.

It’s a very rules-based process, so there’s a high number of alerts and thresholds. For example, if transactions over $10,000 create a flag, you generate a very high false positive volume – so high that it’s not possible for an analyst to go through each alert. Additionally, this activity exists in a silo of one transaction when in fact we’re dealing with a very complex, nefarious human network with continually evolving behavior that doesn’t lend well to a rule-based model. This is a hard problem to tackle in the current process.

We’re working to augment the current process by using Neo4j, big data tools like Spark for graph computation, GraphX for graph theory calculations, and Tensorflow for machine learning.

The goal is to provide analysts with additional context through the ability to view the transaction networks for any flagged transactions, and to create additional metrics to outline common, non-fraudulent transactions to bring down the volume of false positives. Eventually we can use graph analytics plus AI to build a deep learning model that can help detect structures that would uncover behavior that normally goes undetected.

Network Optimization

We’ve been optimizing networks and information flow for decades, so this isn’t really a new type of thinking. Below is our classic case of transportation hubs – Denver and Los Angeles – along with their regional airports:

We can take lessons learned from other industries and apply these network optimization principles to our money laundering use case.

Any person or company who is moving money around tries to do so in the most efficient way possible because it keeps costs lower. This rule applies whether you’re an individual trying to get a mortgage or a financial institution. We can take examples from these different industries to understand what an anomalous network looks like in the context of efficient vs. inefficient.

Let’s go back to our money laundering use case with the following graph, which is an actual structure from real-world banking assets:

Accounts in the blue nodes have created triggers (green nodes), which represent accounts that never went through. With this current structure, the analysis only looks at an individual account rather than the three connected accounts that triggered an alert.

Using graph databases, we can infer those relationships and surface that context to the analyst. This will alert them that this one account is linked to two other accounts that are just a single hop away, which could indicate something nefarious. By inferring that these triggered accounts are related, we can create mathematical and business-based rules around that to bring that up.

Let’s take that even farther to other types of money transaction networks that we frequently see. In this example we have a large number of green account nodes that have been flagged, which warrant a closer look:

But when we explore further, we see that these each represent an optimal money flow, and more likely represent something like the federal banking system. There’s a single central green bank with money dispersed outwards. We can apply mathematical calculations to identify this as a normal random dispersion of money flow.

Let’s take a look at another money transfer example:

Some transactions follow a typical dispersion money model, but the red outlines point to some interconnected hubs that warrant closer inspection.

By using knowledge graphs with Neo4j along with tools like Spark, GraphX and APOC procedures, we can develop metrics to measure how normal/efficient or abnormal/inefficient our transaction network is, and infer relationships to uncover rings. The next step is to create machine learning and data science models to create metrics that bring down false positives at scale.

What’s Next: Machine Learning and AI at Scale

Below is an overview of all the features we’d like to implement at scale, using Neo4j as our underlying big data network:

This includes current AML processes and graph analytics such as betweenness and centrality measures through Spark and APOC procedures. We can also bring in features that do things like a “negative news” scrape, which would import information related to negative news coverage of an individual.

The next goal is to dive into the new field of Graph Convolutional Neural Networks (GCNNs), an experimental field related to deep learning specifically just for graph constructs.

We’re exploring this to develop a deep learning or AI-based model that can help us better understand money laundering. Our goal is to train an AI model to identify a money-laundering structure rather than identifying normal vs. abnormal structures of financial flows.

This same line of thinking can also be applied to customer to customer networks and product to product networks as well.

Ingesting your Data

Michael: Now let’s walk through a couple slides on how to do this in practice with large data.

To get data out of data lakes, you typically build large graph-form tables of nodes and mappings that would be comprised of hundreds of millions of data rows. You can zip those up, and then upload them into Neo4j through the database’s efficient high-speed loader:

I built a graph a couple years ago with half a billion nodes, 2.2 billion relationships, and nine billion properties in about an hour and a half. This is the typical performance, and it’s getting even faster.

There are also several ways to make use of advanced analytics:

You can take data out of Neo4j using whichever tool you like, and then write the results back to your graph, which allows your graph to learn over time. You can productionalize your analytics far more easily than if you have to go through all the steps of doing transformations, create a model to untransform your data, and then write it back to an SQL database.

And finally, this is how these environments can look:

You can use the cloud-based Sandbox to get going, which you’d place on top of your Hadoop in a big server that’s at least a couple hundred gigs. Then you can load up some of your favorite tools like AWS and Azure.

If you have sensitive data, you might consider using an air-gapped solution:

This is a NVIDIA DGX AI workstation, which is what we have back in Seattle. It’s not connected to the internet, and we do data and graph modeling within this system.

How to Identify Whether or Not You Have a Graph Problem

In summary, here are some questions that let you know you’re having a graph problem: How can I get a better understanding of my customers in order to create more relevant experiences? How can I effectively mobilize and syndicate the data I’m ingesting? How can I get more business value and deeper insights from the data that I already have? And what’s the next best action I can take?

This brings us to the entire purpose of graphs: to enable you to very quickly choose what is the next best action for your company.


MySQL Connector/Python 8.0.16 has been released

$
0
0

Feed: Planet MySQL
;
Author: InsideMySQL.com
;

Dear MySQL users,

MySQL Connector/Python 8.0.16 is the latest GA release version of the
MySQL Connector Python 8.0 series. The X DevAPI enables application
developers to write code that combines the strengths of the relational
and document models using a modern, NoSQL-like syntax that does not
assume previous experience writing traditional SQL.

To learn more about how to write applications using the X DevAPI, see
http://dev.mysql.com/doc/x-devapi-userguide/en/. For more information
about how the X DevAPI is implemented in MySQL Connector/Python, and its
usage, see http://dev.mysql.com/doc/dev/connector-python.

Please note that the X DevAPI requires at least MySQL Server version 8.0
or higher with the X Plugin enabled. For general documentation about how
to get started using MySQL as a document store, see
http://dev.mysql.com/doc/refman/8.0/en/document-store.html.

To download MySQL Connector/Python 8.0.16, see the “General Available
(GA) releases” tab at http://dev.mysql.com/downloads/connector/python/

Enjoy!

Enjoy and thanks for the support!

On Behalf of Oracle/MySQL Release Engineering Team,
Balasubramanian Kandasamy

MySQL Connector/Node.js 8.0.16 has been released

$
0
0

Feed: Planet MySQL
;
Author: InsideMySQL.com
;

Dear MySQL users,

MySQL Connector/Node.js is a new Node.js driver for use with the X
DevAPI. This release, v8.0.16, is a maintenance release of the
MySQL Connector/Node.js 8.0 series.

The X DevAPI enables application developers to write code that combines
the strengths of the relational and document models using a modern,
NoSQL-like syntax that does not assume previous experience writing
traditional SQL.

MySQL Connector/Node.js can be downloaded through npm (see
https://www.npmjs.com/package/@mysql/xdevapi for details) or from
https://dev.mysql.com/downloads/connector/nodejs/.

To learn more about how to write applications using the X DevAPI, see
http://dev.mysql.com/doc/x-devapi-userguide/en/. For more information
about how the X DevAPI is implemented in MySQL Connector/Node.js, and
its usage, see http://dev.mysql.com/doc/dev/connector-nodejs/.

Please note that the X DevAPI requires at least MySQL Server version
8.0 or higher with the X Plugin enabled. For general documentation
about how to get started using MySQL as a document store, see
http://dev.mysql.com/doc/refman/8.0/en/document-store.html.

Changes in MySQL Connector/Node.js 8.0.16 (2019-04-25, General
Availability)

X DevAPI Notes

* Connector/Node.js now supports connection attributes as
key-value pairs that application programs can pass to the
server. Connector/Node.js defines a default set of
attributes, which can be disabled or enabled. In addition
to these default attributes, applications can also
provide their own set of custom attributes.

+ Specify connection attributes as a
connection-attributes parameter in a connection
string, or by using the connectionAttributes
property using either a plain JavaScript object or
JSON notation to specify the connection
configuration options.
The connection-attributes parameter value must be
either empty (the same as specifying true), a
Boolean value (true or false to enable or disable
the default attribute set), or a list of zero or
more key=value pair specifiers separated by commas
(to be sent in addition to the default attribute
set). Within a list, a missing key value evaluates
as NULL.
The connectionAttributes property allows passing
user-defined attributes to the application using
either a plain JavaScript object or JSON notation to
specify the connection configuration options. Define
each attribute in a nested object under
connectionAttributes where the property names
matches the attribute names, and the property values
match the attribute values. Unlike
connection-attributes, and while using plain
JavaScript objects or JSON notation, if the
connectionAttributes object contains duplicate keys
then no error is thrown and the last value specified
for a duplicate object key is chosen as the
effective attribute value.
Examples:
Not sending the default client-defined attributes:
mysqlx.getSession(‘{ “user”: “root”, “connectionAttributes”: false }’)

mysqlx.getSession(‘mysqlx://root@localhost?connection-attributes=false
‘)

mysqlx.getSession({ user: ‘root’, connectionAttributes: { foo: ‘bar’,
baz: ‘qux’, quux: ” } })
mysqlx.getSession(‘mysqlx://root@localhost?connection-attributes=[foo=
bar,baz=qux,quux]’)

Application-defined attribute names cannot begin with _
because such names are reserved for internal attributes.
If connection attributes are not specified in a valid
way, an error occurs and the connection attempt fails.
For general information about connection attributes, see
Performance Schema Connection Attribute Tables
(http://dev.mysql.com/doc/refman/8.0/en/performance-schema-connection-attribute-tables.html).

Functionality Added or Changed

* Optimized the reuse of existing connections through
client.getSession() by only re-authenticating if
required.

* For X DevAPI, performance for statements that are
executed repeatedly (two or more times) is improved by
using server-side prepared statements for the second and
subsequent executions. This happens internally;
applications need take no action and API behavior should
be the same as previously. For statements that change,
repreparation occurs as needed. Providing different data
values or different offset() or limit() values does not
count as a change. Instead, the new values are passed to
a new invocation of the previously prepared statement.

Bugs Fixed

* Idle pooled connections to MySQL Server were not reused,
and instead new connections had to be recreated. (Bug
#29436892)

* Executing client.close() would not close all associated
connections in the connection pool. (Bug #29428477)

* connectTimeout instead of maxIdleTime determined whether
idle connections in the connection pool were reused
rather than creating new connections. (Bug #29427271)

* Released connections from the connection pool were not
being reset and reused; instead new connections were
being made. (Bug #29392088)

* Date values in documents were converted to empty objects
when inserted into a collection. (Bug #29179767, Bug
#93839)

* A queueTimeout value other than 0 (infinite) prevented
the acquisition of old released connections from the
connection pool. (Bug #29179372, Bug #93841)

On Behalf of MySQL/ORACLE RE Team
Gipson Pulla

MySQL Connector/NET 8.0.16 has been released

$
0
0

Feed: Planet MySQL
;
Author: InsideMySQL.com
;

Dear MySQL users,

MySQL Connector/NET 8.0.16 is the fourth version to support
Entity Framework Core 2.1 and the sixth general availability release
of MySQL Connector/NET to add support for the new X DevAPI, which
enables application developers to write code that combines the
strengths of the relational and document models using a modern,
NoSQL-like syntax that does not assume previous experience writing traditional SQL.

To learn more about how to write applications using the X DevAPI, see
http://dev.mysql.com/doc/x-devapi-userguide/en/index.html. For more
information about how the X DevAPI is implemented in Connector/NET, see
http://dev.mysql.com/doc/dev/connector-net. NuGet packages provide functionality at a project level. To get the
full set of features available in Connector/NET such as availability
in the GAC, integration with Visual Studio’s Entity Framework Designer
and integration with MySQL for Visual Studio, installation through
the MySQL Installer or the stand-alone MSI is required.

Please note that the X DevAPI requires at least MySQL Server version
8.0 or higher with the X Plugin enabled. For general documentation
about how to get started using MySQL as a document store, see
http://dev.mysql.com/doc/refman/8.0/en/document-store.html.

To download MySQL Connector/NET 8.0.16, see
http://dev.mysql.com/downloads/connector/net/

Installation instructions can be found at
https://dev.mysql.com/doc/connector-net/en/connector-net-installation.html

Changes in MySQL Connector/NET 8.0.16 ( 2019-04-25, General Availability )

* Functionality Added or Changed

* Bugs Fixed

Functionality Added or Changed

* Document Store: Support was added for the -> operator to
be used with JSON document paths in relational
statements. For example:
table.Select().Where(“additionalinfo->$.hobbies = ‘Reading'”);

(Bug #29347028)

* Document Store: The performance for statements that are
executed repeatedly (two or more times) is improved by
using server-side prepared statements for the second and
subsequent executions. This happens internally;
applications need take no action and API behavior should
be the same as previously. For statements that change,
repreparation occurs as needed. Providing different data
values or different OFFSET or LIMIT clause values does
not count as a change. Instead, the new values are passed
to a new invocation of the previously prepared statement.

* Document Store: Connector/NET now supports the ability to
send connection attributes (key-value pairs that
application programs can pass to the server at connect
time). Connector/NET defines a default set of attributes,
which can be disabled or enabled. In addition,
applications can specify attributes to be passed together
with the default attributes. The default behavior is to
send the default attribute set.
The aggregate size of connection attribute data sent by a
client is limited by the value of the
performance_schema_session_connect_attrs_size server
variable. The total size of the data package should be
less than the value of the server variable. For X DevAPI
applications, specify connection attributes as a
connection-attributes parameter in a connection string.
For usage information, see Options for X Protocol Only
(http://dev.mysql.com/doc/connector-net/en/connector-net-8-0-connection-options.html#connector-net-8-0-connection-options-xprotocol).
For general information about connection attributes, see
Performance Schema Connection Attribute Tables
(http://dev.mysql.com/doc/refman/8.0/en/performance-schema-connection-attribute-tables.html).

* Document Store: Connector/NET now has improved support
for resetting sessions in connection pools. Returning a
session to the pool drops session-related objects such as
temporary tables, session variables, and transactions,
but the connection remains open and authenticated so that
reauthentication is not required when the session is
reused.

* Connector/NET applications now can use certificates in
PEM format to validate SSL connections in addition to the
native PFX format (see Tutorial: Using SSL with Connector/NET
(http://dev.mysql.com/doc/connector-net/en/connector-net-tutorials-ssl.html)).
PEM support applies to both classic MySQL protocol
and X Protocol connections.

Bugs Fixed

* Document Store: All methods able to execute a statement
were unable to execute the same statement a second time.
Now, the values and binding parameters remain available
after the method is executed and string parameters are no
longer converted to numbers. Both changes enable a
follow-on execution to reuse the previous parameters.
(Bug #29249857, Bug #29304767)

* An exception was generated when the MySqlDbType
enumeration was given an explicit value and then passed
as a parameter to the MySqlCommand.Prepare method. (Bug
#28834253, Bug #92912)

* Validation was added to ensure that when a column is of
type TIME and the value is 00:00:00, it takes the value
instead of setting NULL. (Bug #28383726, Bug #91752)

On Behalf of MySQL Release Engineering Team,
Surabhi Bhat


MySQL 8.0.16: how to validate JSON values in NoSQL with check constraint

$
0
0

Feed: Planet MySQL
;
Author: Frederic Descamps
;

As you may have noticed, MySQL 8.0.16 has been released today !

One of the major long expected feature is the support of CHECK contraints .

My colleague, Dave Stokes, already posted an article explaining how this works.

In this post, I wanted to show how we could take advantage of this new feature to validate JSON values.

Let’s take the following example:

So we have a collection of documents representing rates from a user on some episodes. Now, I expect that the value for the rating should be between 0 and 20.

Currently I could enter whatever value, even characters…

To avoid characters, I can already create a virtual column as integer:

So now, only integer value for rating should be allowed:

Perfect, but can I enter any integer value ?

In fact yes of course ! And that’s where the new CHECK Constraints enter in action !

We need first to modify the current document having a value for the ratingattribute that won’t be valid for the new constraints.

And now we can test again:

Woohooo! Nice feature that also benefits to the MySQL Document Store !

For the curious that want to see how the table looks like in SQL definition:

Enjoy NoSQL with MySQL 8.0 Document Store #MySQL8isGreat.

MemSQL Offers Streaming Systems Download from O’Reilly

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

More and more, MemSQL is used to help add streaming characteristics to existing systems, and to build new systems that feature streaming data from end to end. Our new ebook excerpt from O’Reilly introduces the basics of streaming systems. You can then read on – in the full ebook and here on the MemSQL blog – to learn about how you can make streaming part of all your projects, existing and new.

Streaming has been largely defined by three technologies – one that’s old, one that’s newer, and one that’s out-and-out new. Streaming Systems covers the waterfront thoroughly.

Originally, Tyler Akidau, one of the book’s authors, wrote two very popular blog posts: Streaming 101: The World Beyond Batch, and Streaming 102, both on the O’Reilly site. The popularity of the blog posts led to the popular O’Reilly book, Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing.

In the excerpt that we offer here, you will see a solid definition of streaming and how it works with different kinds of data. The authors address the role of streaming in the entire data processing lifecycle with admirable thoroughness.

They also describe the major concerns you’ll face when working with streaming data. One of these is the difference between the order in which data is received and the order in which processing on it is completed. Reducing such disparities as much as possible is a major topic in streaming systems.

Scatter plot of event arrival time vs. processing completion time.
In this figure from the Streaming Systems excerpt, the authors plot event
arrival time vs. processing completion time.

In both the excerpt, and the full ebook, the authors also tackle three key streaming technologies that continue to play key roles in the evolution of MemSQL: Apache Kafka, Apache Spark, and – perhaps surprisingly, in this context – SQL.

Apache Kafka and Streaming

Apache Kafka saw its 1.0 version introduced by Confluent in late 2017. (See Apache Kafka 1.0 Introduced Exactly Once at The New Stack.) MemSQL works extremely well with Kafka. Both Kafka and MemSQL are unusual in supporting exactly-once updates, a key feature that not only adds valuable capabilities, but affects how you think about data movement within your organization.

It’s very easy to connect Kafka streams to MemSQL Pipelines for rapid ingest. And MemSQL’s Pipelines to stored procedures feature lets you handle complex transformations without interfering with the streaming process.

Bar chart showing Spark and Kafka as recent streaming arrivals.
In this figure from the full Streaming Systems ebook, Kafka
and Spark both appear as relatively recent arrivals.

Apache Spark and Streaming

Apache Spark is an older streaming solution, initially released in 2014. (One of the key components included in the 1.0 release was Spark SQL, for ingesting structured data into Spark.) Spark was first developed to address concerns with Google’s MapReduce data processing approach. While widely used, Spark is perhaps as well known today for its machine learning and AI capabilities as for its core streaming functionality.

MemSQL first introduced the MemSQL Spark Connector in 2015, then included full Spark support in MemSQL Pipelines and Pipelines to stored procedures. Today, Spark and MemSQL work very well together. MemSQL customer Teespring used Kafka and Spark together for machine learning implementations.

A reference architecture shows data streaming in from S3, Resdhift, Kafka, and Spark to real-time analytics.
The MemSQL case study for Teespring shows Kafka
and Spark used together for machine learning.

SQL and Streaming

Ironically, one of the foundational data technologies, SQL, plays a big role in Streaming Systems, and in the future of streaming. SQL is all over the full book’s Table of Contents:

  • Streaming SQL is Chapter 8 of the full ebook. In this chapter, the authors discuss how to use SQL robustly in a streaming environment.
  • Streaming Joins is Chapter 9. Joins are foundational to analytics, and optimizing them has been the topic of decades of work in the SQL community. Yet joins are often neglected in the NoSQL movement that is most closely associated with streaming. Streaming Systems shows how to use joins in a streaming environment.

MemSQL is, of course, a leading database in the NewSQL movement. NewSQL databases combine the best of traditional relational databases – transactions, structured data, and SQL support – with the best of NoSQL: scalability, speed, and flexibility.

Saeed Barghi of MemSQL partner Zoomdata shows Kafka and MemSQL used together for business intelligence.

A reference architecture with data streaming from Confluent Kafka into MemSQL through a Pipeline.
MemSQL partner ZoomData uses Kafka and a
MemSQL Pipeline for real-time data visualization.

Next Steps to Streaming

We recommend that you download and read our book excerpt from Streaming Systems today. If you find it especially valuable, consider getting the full ebook from O’Reilly.

If you wish to move to implementation, you can start with MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.

The Need for Operational Analytics

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

The proliferation of streaming analytics and instant decisions to power dynamic applications, as well as the rise of predictive analytics, machine learning, and operationalized artificial intelligence, have introduced a requirement for a new type of database workload: operational analytics.

The two worlds of transactions and analytics, set apart from each other, are a relic of a time before data became an organization’s most valuable asset. Operational analytics is a new set of database requirements and system demands that are integral to achieving competitive advantage for the modern enterprise.

This new approach was called for by Gartner as a Top 10 technology for 2019, under the name “continuous analytics.” Delivering operational analytics at scale is the key to real-time dashboards, predictive analytics, machine learning, and enhanced customer experiences which differentiate digital transformation leaders from the followers.

However, companies are struggling to build these new solutions because existing legacy database architectures cannot meet the demands placed on them. The existing data infrastructure cannot scale to the load put on it, and it doesn’t natively handle all the new sources of data.

The separation of technologies between the transactional and analytic technologies results in hard tradeoffs that leave solutions lacking in operational capability, analytics performance, or both. There have been many attempts in the NoSQL space to bridge the gap, but all have fallen short of meeting the needs of this new workload.

Operational analytics enables businesses to leverage data to enhance productivity, expand customer and partner engagement, and support orders of magnitude more simultaneous users. But these requirements demand a new breed of database software that goes beyond the legacy architecture.

The industry calls these systems by several names: hybrid transaction and analytics processing (HTAP) from Gartner; hybrid operational/analytics processing (HOAP) from 451 Research; and translytical from Forrester.

Consulting firms typically use the term we have chosen here, operational analytics, and CapGemini has even established a full operational analytics consultancy practice around it.

The Emergence of Operational Analytics

Operational Analytics has emerged alongside the existing workloads of Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). I outlined the requirements of those workloads in this previous blog entry.

To summarize, OLTP requires data lookups, transactionality, availability, reliability, and scalability. Whereas OLAP requires support for running complex queries very fast, large data sets, and batch ingest of large amounts of data. The OLTP and OLAP based systems served us well for a long time. But over the last few years things have changed.

Decisions should not have to wait for the data

It is no longer acceptable to wait for the next quarter, or week, or even day to get the data needed to make a business decision. Companies are increasingly online all the time; “down for maintenance” and “business hours” are quickly becoming a thing of the past. Companies that have a streaming real-time data flow have a significant edge over their competitors. Existing legacy analytics systems were simply not designed work like this.

Companies must become insight driven

This means that, instead of a handful of analysts querying the data, you have hundreds or thousands of employees hammering your analytics systems every day in order to make informed decisions about the business. In addition, there will be automated systems – ML/AI and others – also running queries to get the current state of the world to feed their algorithms. The existing legacy analytics systems were simply not designed for this kind of usage.

Companies must act on insights to improve customer experience

Companies want to expose their data to their customers and partners. This improves the customer experience and potentially adds net new capabilities. For example, a cable company tracks users as they try to set up their cable modems so they can proactively reach out if they see there is a problem. This requires a system that can analyze and react in real-time.

Another example is an electronics company that sells smart TVs and wants to expose which shows customers are watching to its advertisers. This dramatically increases the number of users trying to access your analytics systems.

In addition, the expectations of availability and reliability are much higher for customers and partners. So you need a system that can deliver an operational service level agreement (SLA). Since your partners don’t work in your company, it means you are exposing the content outside the corporate firewall, so strong security is a must. The existing legacy analytics systems were simply not designed for this kind of usage.

Data is coming from many new sources and in many types and formats

The amount of data being collected is growing tremendously. Not only is it being collected from operational systems within the company; data is also coming from edge devices. The explosion of IOT devices, such as oil drills, smart meters, household appliances, factory machinery, etc., are the key contributors to the growth.

All this data needs to be fed into the analytics system. This leads to an increased complexity in the types of data sources (such as Kafka, Spark, etc…) as well as data types and formats (geospatial, JSON, AVRO, Parquet, raw text, etc.) and throughput requirements for ingest of the data. Again, the existing legacy analytics systems were simply not designed for this kind of usage.

The Rise of Operational Analytics

These changes have given rise to a new database workload, operational analytics. The short description of operational analytics is an analytical workload that needs an operational SLA. Now let’s unpack what that looks like.

Operational Analytics as a Database Workload

Operational analytics primarily describes analytical workloads. So the query shapes and complexity are similar to OLAP queries. In addition, the size of the data sets are just as large as OLAP, although often it is the most recent data that is more important. (This is usually a fraction of the total data set.) Data loading is similar to OLAP workloads, in that data comes from an external source and is loaded independent of the applications or dashboards that are running the queries.

But this is where the differences end. Operational analytics has several characteristics that set it apart from pure OLAP workloads. Specifically, the speed of data ingestion, scaling for concurrency, availability and reliability, and speed of query response.

Operational analytics workloads require an SLA on how fast the data needs to be available. Sometimes this is measured in seconds or minutes, which means the data infrastructure must allow streaming the data in constantly, while still allowing queries to be run.

Sometimes this means there’s a window of time (usually a single-digit number of hours) during all the data must be ingested. As data sets grow, the existing data warehouse (DW) technologies have had trouble loading the data within the time window (and certainly don’t allow streaming). Data engineers often have to do complex tricks to continue meeting data loading SLAs with existing DW technologies.

Data also has to be loaded from a larger set of data sources than in the past. It used to be that data was batch-loaded from an operational system during non-business hours. Now data comes in from many different systems.

In addition, data can flow from various IoT devices far afield from the company data center. The data gets routed through various types of technologies (in-memory queues like Kafka, processing engines like Spark, etc.). Operational analytics workloads need to easily handle ingesting from these disparate data sources.

Operational analytics workloads also need to scale to large numbers of concurrent queries. With the drive towards being data driven and exposing data to customers and partners, the number of concurrent users (also queries) in the system has increased dramatically. In an OLAP workload, five to ten queries at a time was the norm. Operational analytics workloads often must be able to handle high tens, hundreds, or even thousands of concurrent queries.

As in an OLTP workload, availability and reliability are also key requirements. Because these systems are now exposed to customers or partners, the SLA required is a lot stricter than for internal employees.

Customers expect a 99.9% or better uptime and they expect the system to behave reliably. They are also less tolerant of planned maintenance windows. So the data infrastructure backing these systems needs to have support for high availability, with the ability to handle hardware and other types of failure.

Maintenance operations (such as upgrading the system software or rebalancing data) need to become transparent, online operations that are not noticeable to the users of the system. In addition, the system should self-heal when a problem occurs, rather than waiting for an operator to get alerted to an issue and respond.

Strong durability is important as well. This is because even though data that is lost could be reloaded, the reloading may cause the system to break the availability SLA.

The ability to retrieve the data you are looking for very quickly is the hallmark feature of database systems. Getting access to the right data quickly is a huge competitive advantage. Whether it is internal users trying to get insights into the business, or you are presenting analytics results to a customer, the expectation is that the data they need is available instantly.

The speed of the query needs to be maintained regardless of the load on the system. It doesn’t matter if there is a peak number of users online, the data size has expanded, or there are failures in the system. Customers expect you to meet their expectations on every query with no excuses.

This requires a solid distributed query processor that can pick the right plan to answer any query and get it right every time. It means the algorithms used must scale smoothly with the system as it grows in every dimension.

Supporting Operational Analytics Use Cases with MemSQL

MemSQL was built to address these requirements in a single converged system. MemSQL is a distributed relational database that supports ANSI SQL. It has a shared-nothing, scale-out architecture that runs well on industry standard hardware.

This allows MemSQL to scale in a linear fashion simply by adding machines to a cluster. MemSQL supports all the analytical SQL language features you would find in a standard OLAP system, such as joins, group by, aggregates, etc.

It has its own extensibility mechanism so you can add stored procedures and functions to meet your application requirements. MemSQL also supports the key features of an OLTP system: transactions, high availability, self-healing, online operations, and robust security.

It has two storage subsystems: an on-disk column store that gives you the advantage of compression and extremely fast aggregate queries, as well as an in-memory row store that supports fast point queries, aggregates, indices, and more. The two table types can be mixed in one database to get the optimal design for your workload.

Finally, MemSQL has a native data ingestion feature, called Pipelines, that allows you to easily and very quickly ingest data from a variety of data sources (such as Kafka, AWS S3, Azure Blob, and HDFS). All these capabilities offered in a single integrated system add up to making it the best data infrastructure for an operational analytics workload, bar none.

Describing the workload in general terms is a bit abstract, so let’s dig into some of the specific use cases where operational analytics is the most useful.

Portfolio Analytics

One of the most common use cases we see in financial services is portfolio analytics. Multiple MemSQL customers have written financial portfolio management and analysis systems that are designed to provide premium services to elite users.

These elite users can be private banking customers with high net worth or fund managers who control a large number of assets. They will have large portfolios with hundreds or thousands of positions. They want to be able to analyze their portfolio in real-time, with graphical displays that are refreshed instantaneously as they filter, sort, or change views in the application. The superb performance of MemSQL allows sub-second refresh of the entire screen with real-time data, including multiple tables and charts, even for large portfolios.

These systems also need to scale to hundreds or thousands of users concurrently hitting the system, especially when the market is volatile. Lastly, they need to bring in the freshest market data, without compromising the ability to deliver the strict latency SLAs for their query response times.

They need to do all of this securely without violating relevant compliance requirements nor the trust of their users. High availability and reliability are key requirements, because the market won’t wait. MemSQL is ideal data infrastructure for this operational analytics use case as it solves the key requirements of fast data ingest, high scale concurrent user access, and fast query response.

Predictive Maintenance

Another common use case we see is predictive maintenance. Customers who have services or devices that are running continuously want to know as quickly as possible if there is a problem.

This is a common scenario for media companies that do streaming video. They want to know if there is a problem with the quality of the streaming so they can fix it, ideally before the user notices the degradation.

This use case also comes up in the energy industry. Energy companies have devices (such as oil drills, wind turbines, etc.) in remote locations. Tracking the health of those devices and making adjustments can extend their lifetime and save millions of dollars in labor and equipment to replace them.

The key requirements are the ability to stream the data about the device or service, analyze the data – often using a form of ML that leverages complex queries – and then send an alert if the results show any issues that need to be addressed. The data infrastructure needs to be online 24/7 to ensure there is no delay in identifying these issues.

Personalization

A third use case is personalization. Personalization is about customizing the experience for a customer. This use case pops in a number of different verticals, such as a user visiting a retail web site, playing a game in an online arcade, or even visiting a brick and mortar store.

The ability to see a user’s activity and, more importantly, learn what is attractive to them, gives you the information to meet their needs more effectively and efficiently. One of MemSQL’s customers is a gaming company. They stream information about the user’s activity in the games, process the results against a model in MemSQL, and use the results to offer the user discounts for new games and other in-app purchases.

Another example is a popular music delivery service that uses MemSQL to analyze usage of the service to optimize ad spend. The size of data and the number of employees using the system made it challenging to deliver the data in a timely way to the organization and allow them to query the data interactively. MemSQL significantly improved their ability to ingest and process the data and allowed their users to get a dramatic speedup in their query response times.

Summary

Operational analytics is a new workload that encompasses the operational requirements of an OLTP workload – data lookups, transactionality, availability, reliability, and scalability – as well as the analytical requirements of an OLAP workload – large data sets and fast queries.

Coupled with the new requirements of high user concurrency and fast ingestion, the operational analytics workload is tough to support with a legacy database architecture or by cobbling together a series of disparate tools. As businesses continue along their digital transformation journey they are finding more and more of their workloads fit this pattern and are searching for new modern data infrastructure, like MemSQL, that has the performance and scale capabilities to handle them.

Graphs in Government: Introduction to Graph Technology

$
0
0

Feed: Neo4j Graph Database Platform.
Author: Jocelyn Hoppa.

The use cases for a graph database in government are endless.

Graphs are versatile and dynamic. They are the key to solving the challenges you face in fulfilling your mission.

Using real-world government use cases, this blog series explains how graphs solve a broad range of complex problems that can’t be solved in any other way.

Discover how graph technology is being used in government.

In this series, we will show how storing data in a graph offers benefits at scale, for everything from the massive graph used by the U.S. Army for managing strategic assets to recalling NASA’s lessons learned over the past 50 years.

Graphs Are Everywhere

Everywhere you look, you’ll find problems whose solutions involve connecting data and traversing data relationships, often across different applications or repositories, to answer questions that span processes and departments.

Uncovering the relationships between data locked in various repositories requires a graph database platform that’s flexible, scalable and powerful. A graph database platform reveals data connectedness to achieve your agency’s mission-critical objectives – and so much more.

The Power of a Graph Database Platform

To understand the power of a graph database, first consider its collection-oriented predecessor, a traditional relational database.

Relational databases are good for well-understood, often aggregated, data structures that don’t change frequently – known problems involving minimally connected or discrete data. Increasingly, however, government agencies and organizations are faced with problems where the data topology is dynamic and difficult to predict, and relationships among the data contribute meaning, context and value. These connection-oriented scenarios necessitate a graph database.

A graph database enables you to discover connections among data, and do so much faster than joining tables within a traditional relational database or even using another NoSQL database such as MongoDB or Elasticsearch.

Neo4j is a highly scalable, native graph database that stores and manages data relationships as first-class entities. This means the database maintains knowledge of the relationships, as opposed to a relational database (RDBMS), which instantiates relationships using table JOINs based on a shared key or index.

A native graph database like Neo4j offers index-free adjacency: data is inherently connected with no foreign keys required. The relationships are stored right with the data object, and connected nodes physically point to each other.

Discover the difference between a relational database and a graph database.

Conclusion

As we’ve shown, a graph database enables you to discover connections among data, and do so much faster than joining tables within a traditional relational database.

Graph databases are as versatile as the government agencies that use them. In the coming weeks, we’ll continue showing the innovative ways government agencies are using graph databases to fulfill their missions.

Explore:   • • • • • • • • • •


About the Author

Jason Zagalsky , Federal Account Manager, Neo4j


Jason Zagalsky Image

Jason has 20 years of technical sales and engineering design experience. He does full technology stack software sales that includes database, middleware, identity management, content management, business intelligence and engineered systems. He has in-depth knowledge of high-performance computing systems, storage systems and advanced visualization, as well as complex real-time embedded computer systems – from system level architectures to low-level programming of FPGA-based processing hardware and algorithm implementation. Jason is also a subject matter expert in secure information sharing/cross domain solutions with deep content inspection and sanitization.


Graphs in Government: The Power of Graph Technology

$
0
0

Feed: Neo4j Graph Database Platform.
Author: Jocelyn Hoppa.

The use cases for a graph database in government are endless.

Graphs are versatile and dynamic. They are the key to solving the challenges you face in fulfilling your mission.

Using real-world government use cases, this blog series explains how graphs solve a broad range of complex problems that can’t be solved in any other way.

Last week we gave an overview of this series, such as how storing data in a graph offers benefits at scale, for everything from the massive graph used by the U.S. Army for managing strategic assets to recalling NASA’s lessons learned over the past 50 years.

Learn more about graph technology is used in government agencies.

This week we will show how Neo4j enables government agencies and organizations to perform deep and complex queries, reduce infrastructure costs, maximize value from existing resources, deliver immediate answers at scale and meet security demands.

Perform Deep, Complex Queries

Governments today are challenged with solving complex problems. With the vast amount of data they have pouring in, the answers exist somewhere – but only if you can make sense of the growing volume, variety and interrelationships of data in disparate sources.

Data becomes more useful once its connectedness is established. Connected data is the representation, usage and persistence of relationships between data elements. Neo4j makes it possible to query relationships across disparate data sources, regardless of the type of data or originating database.

Neo4j connects multiple layers of data – across processes, people, networks and things. Once you’ve connected layers, you gain intelligence downstream and provide a connected view of the data to analytic and operational applications. You also obtain context that allows you to more deeply or better refine the pieces of information you’re collecting. The better your understanding of data connections, the better your downstream insights will be.

Neo4j empowers government agencies and organizations to iterate and expand on current datasets, gaining momentum to execute on bigger and better ideas, and find deeper contextual meaning in the data.

Using graph technology, you can increase the number of hops (the levels of connections) between data without a corresponding increase in compute cost. As a result, you gain higher degrees of context not easily achieved by JOINing three or four tables together in an RDBMS.

Neo4j’s architecture enables these deep, complex queries. The enterprise-grade, native graph database is built from the ground up to traverse data connections at depth, in real time and at scale.

Reduce Infrastructure Costs

Your government agency runs on a lean budget. Any opportunity to reduce infrastructure spending frees up resources to focus on the core mission. A graph database does just that.

It delivers deep, complex queries with less hardware, which means reduced costs. The standard, highly available Neo4j installation is 3-5 servers, versus an RDBMS with a graph layer, which requires about 50 servers for the same scale. With this efficiency, Neo4j also requires fewer licenses, further reducing database costs. Neo4j offers deployment flexibility, with servers on-premises or in the cloud.

Maximize Value from Existing Resources

A rip-and-replace approach is a non-starter for most government technology projects. By connecting data across diverse existing data stores, Neo4j leverages the value of all your existing systems. And when it’s time to replace aging applications, government contractors and agencies find that Neo4j is a cost-effective agile foundation for new initiatives.

Deliver Immediate Answers at Scale

Government agencies and organizations must store massive amounts of data and need answers fast.

Neo4j delivers a 1,000x performance advantage over relational and other NoSQL databases hosting graph engines, reducing response times from minutes to milliseconds for queries of graphs containing billions of connections.

Neo4j traverses any level of data in real-time due to its native graph architecture. RDBMS and other NoSQL databases typically see a significant performance degradation when traversing data beyond three levels of depth.

Meet Security Demands

Neo4j fulfills the stringent security demands of government customers. In addition to meeting Federal Information Security Modernization Act (FISMA) requirements, Neo4j’s advanced security architecture supports attribute-based access control (ABAC) as well as role-based access control (RBAC).

Neo4j is approved to run in a classified environment by many Department of Defense and Intelligence Community agencies. Authority to Operate (AO) has been granted for several applications that are built on Neo4j running on classified networks. Many civilian agencies have Neo4j approved to run on their networks as well.

The Value of Connected Data in Criminal Investigations

Criminal investigations highlight the value of connected data, because connections in data point to potential suspects in a case. A suspect often appears in several different databases. Connecting that data is key for investigators to find out all they can about a suspect through phone records, financial transactions, fingerprints, DNA, court records, associates and more.

Separate data silos of people, objects, locations and events (POLE) aren’t useful if you’re doing a criminal investigation or trying to stop a terrorist attack. You need the relationships that span those data silos and contextualize the activities and associations among suspects.

The idea hinges on who knows who. If Person X has come to the attention of the authorities for whatever reason, who else in Person X’s network might be of interest? This complexity is hard to capture and explore through conventional database technologies like RDBMS. Graph database platforms excel at analyzing connected data.

Conclusion

As we’ve shown, Neo4j enables government agencies and organizations to do many functions, such as perform deep complex queries and reduce infrastructure costs.

Graph databases are as versatile as the government agencies that use them. In the coming weeks, we’ll continue showing the innovative ways government agencies are using graph databases to fulfill their missions.

Explore:   • • • • • • • • •


About the Author

Jason Zagalsky , Federal Account Manager, Neo4j


Jason Zagalsky Image

Jason has 20 years of technical sales and engineering design experience. He does full technology stack software sales that includes database, middleware, identity management, content management, business intelligence and engineered systems. He has in-depth knowledge of high-performance computing systems, storage systems and advanced visualization, as well as complex real-time embedded computer systems – from system level architectures to low-level programming of FPGA-based processing hardware and algorithm implementation. Jason is also a subject matter expert in secure information sharing/cross domain solutions with deep content inspection and sanitization.


Hazelcast Responds to Redis Labs’ Benchmark

$
0
0

Feed: Blog – Hazelcast.
Author: Greg Luck.

Hazelcast Responds to Redis Labs’ Benchmark

Due to its underlying architecture and many years of optimization, Hazelcast is extremely fast and dramatically outperforms Redis Labs (and Redis open source), especially at scale.

Last year, Redis Labs published a very misleading benchmark against Hazelcast. We have closely investigated Redis Labs’ test and discovered many misleading aspects of the benchmark. As a result, we have reproduced the test with these issues corrected.

To be clear, Redis Labs is not the same thing as Redis, which is a popular open source project driven singularly by Salvatore Sanfilippo. By contrast, Redis Labs is a company that produces a proprietary, closed source fork of Redis which is client and API compatible, called Redis Enterprise.

This was an action initiated by the company Redis Labs, something with which Salvatore and Redis – the open source project – had no involvement.

The Facts

It is difficult to avoid the conclusion that Redis Labs knew it was manipulating the desired outcome. Here are the facts:

  • ZERO TRANSPARENCY: Firstly, the benchmark configuration was closed source. There are some snippets in the blog post, but not the full configuration used for the test. Hazelcast publishes its source code and configuration for competitors and customers to reproduce the benchmark.
  • INTENTIONALLY DIFFERENT TOOLS: Redis Labs’ benchmark compares the performance of memtier_benchmark driving C++ clients against Redis Enterprise versus RadarGun driving Java clients against Hazelcast open source. This is invalid. You must use the same benchmarking tool and the same programming language; otherwise, you’re comparing apples and oranges, which is the case with the Redis Labs benchmark.
  • INTENTIONALLY OVERLOADING GARBAGE COLLECTION: The dataset size was 42GB, run on three Hazelcast nodes. With Hazelcast defaults, this would typically mean 84GB of data. Heap size per JVM would, therefore, have to be at least 30GB (28GB of storage and 2GB of working space). Hazelcast recommends the use of Hazelcast Enterprise HD instead of running large heap sizes, due to the limitations of garbage collection (GC), when running on a smaller number of nodes. GC alone would be enough to make Hazelcast run slowly. Note that as an alternative and to alleviate GC issues, Hazelcast Open Source can run on more nodes each with smaller heap configurations. Redis Labs had options on how to configure the data within Hazelcast; they chose not to use them.
  • SKEWING OF DATA SIZES: The sample data sizes were different between the two benchmarks. With memtier_benchmark Redis Labs used a random distribution of value lengths between 100 bytes and 10KB. RadarGun uses fixed size values and is not able to generate a random distribution. Once again, Redis Labs does not provide access to the configuration source, so Hazelcast does not know what size was used with RadarGun.
  • INTENTIONALLY LOW THREAD COUNT: The benchmark uses a single thread per client. We found this very strange as it does not reflect production reality. In production, Hazelcast and presumably Redis Labs have clients connecting to it running many threads.
  • COMPARING PIPELINING WITH SINGLE CALLS: Redis has pipelining where multiple commands are set and executed on the server. Hazelcast has async methods, which operate similarly. Both approaches result in higher throughput, but Redis Labs used Pipelining for Redis Labs Enterprise but did not use async for Hazelcast.

Redis Labs Refused To Grant a License Key

To reproduce the benchmark we wanted to test the proprietary Redis Labs Enterprise version, seeing as they chose to use that version against Hazelcast (as opposed to open source Redis). Not surprisingly, Redis Labs kept things in the dark.

For the flawed benchmark, Redis Labs used Redis Enterprise with a license for 48 shards. Since its downloadable trial license is limited to 4 shards, we reached out to them with the proper license request.

We contacted Keren Ouaknine, the Redis Labs performance engineer who published the benchmark, and requested both the configuration source and a trial license of Redis Enterprise to let us reproduce the test. Neither request was granted. We then asked the same from Yiftach Shoolman, CTO of Redis Labs. No reply was received.

Redis Labs Versus Redis Open Source: No Faster

Stuck with only the 4-shard trial license, we decided to examine the performance of Redis Open Source vs. Redis Labs Enterprise running with 4 shards.

We benchmarked Redis Labs Enterprise 5.2.0 and Redis Open Source 4.0.11 and found no performance difference.

Re-running Redis Labs’ Benchmark

We re-ran their benchmark, after correcting for its “errors” and have published the following:

The things we corrected were:

  • We used Java clients for both. The original benchmark used C++ with Redis. We added Redis with the Lettuce client to RadarGun. Java is by far the most popular client for Hazelcast.
  • We used the same benchmarking tool with the same test configuration for each. The original benchmark uses memtier_benchmark. We added Redis support to RadarGun, the tool they used to test Hazelcast.
  • We used Hazelcast Enterprise HD 3.12, our latest version. The original benchmark used open source Hazelcast 3.9 with a 40GB heap which caused serious garbage collection problems.
  • We used the same payload size distribution for each. The original benchmark used fixed 10KB values with Hazelcast due to the limitations of RadarGun and 5KB average payload sizes for Redis; thus we added variable payload size support to RadarGun.
  • We used pipelining for both. The original benchmark used pipelining for Redis but not Hazelcast. In IMDG 3.12, Hazelcast now supports pipelining. We added it to RadarGun.
  • We used async backup replication for both. The original benchmark used async for Redis and sync for Hazelcast.
  • We used Redis Open Source 5.0.3 because we could not get a license for Redis Labs Enterprise. We tested both and saw no performance difference at the 4 shard level, so consider Redis Open Source to be a good proxy.

Again, note that this benchmark is at very low thread count, which is not typical of operational deployments. Regardless, with the above corrections we found that, in our benchmark, we were slightly faster than Redis Labs Enterprise.

Put differently, Redis Labs carefully manipulated this benchmark to show off the benefits of pipelining. We agree pipelining is good, and when the benchmark configures it in both products the results are clear.

There are lots of ways of doing benchmarks. Ultimately, we recommend users take their workload and benchmark against it to honestly figure out what solution works.

However, since this topic is now open, there are two very important additional areas that come up a lot around performance:

  1. How does the system perform at scale?
  2. IMDGs all have near caches. NoSQL, including Redis, does not. How does that affect performance?

Hazelcast Outperforms Redis at Scale

A simple out-of-the-box connection and perform “n” operations in a loop, also known as a single threaded test, shows that Redis is faster (see blue lines in the two charts below). However, if you scale those threads up, continually faster and extends its lead over Redis. At 128 threads Hazelcast has almost double the throughput of Redis (purple lines below).

Hazelcast Scaling Behavior

Redis Scaling Behavior

Plotted differently, the view is very clear: Hazelcast demonstrates near-linear scaling while Redis hits its limit at only 32 threads. In our experience with the world’s largest customers, Redis’ thread limitation is extremely problematic and either fails to meet the proof-of-concept requirements or fails to accommodate load growth over time.

Relationship of Scale to Threads

Hazelcast scalability is easy to explain: It’s rooted in the fact that Hazelcast is multi-threaded in the client and server, utilizing an approach known as Sequenced Event Driven Architecture (SEDA). This allows high-scale and efficiency with large numbers of threads. On the other hand, Redis clients, including Redis open source and Redis Labs Enterprise, are single-threaded and use blocking IO. This is a very simple approach that is fast for a single thread, but not for multi-threaded applications. Dealing with large numbers of threads is found in production environments.

As an aside, Redis Labs claims that Redis Labs Enterprise is multi-threaded. In computer science terms this is not true; they have a per-machine load balancer per server which farms requests out to Redis processes on that server. Each Redis Labs Enterprise process is single-threaded, just like Redis open source. Compared to Hazelcast multi-threading, this again explains the scalability difference.

Hazelcast has Near Cache: Redis Doesn’t

A near cache is an LRU cache that sits in the client, with an eventually consistent data integrity guarantee. These are great for read-heavy caching workloads. All in-memory data grids have this feature. NoSQL databases do not.

To implement this, you need to have a sophisticated client. Hazelcast’s Java client is multi-threaded and has a sophisticated and highly performant near cache. It can store data off-heap and it can reload the last keys it held on restart.

With near cache, Hazelcast data access speeds are in high nanoseconds.

How much benefit you get from a near cache depends on the workload. Many are Pareto distributions so that a near cache of 20% can speed up access times by 80%.

In our first benchmark against Redis several years ago, we demonstrated the effect of adding a near cache. Hazelcast performed 5 million get requests per second versus Redis at 750,000.

Conclusion

Hazelcast is by far the fastest in-memory solution in the market, particularly when requirements get complicated. This capability is the result of a sustained, multi-year effort, working with some of the world’s most demanding customers. It’s vital that businesses make decisions based on accurate and transparent information, which is why we took Redis Lab’s benchmark, fixed its inherent flaws and proved we were as fast in low thread counts. We then proceeded to show how we perform far better at scale, and the effect near cache can have on performance.

To ensure customers and users have the most accurate information, we do our best to publish our test configurations and code used in GitHub. This is so that customers and users can take these tools to reproduce the results and adapt them to their scenarios.

It is difficult to conclude anything other than Redis Labs deliberately created a misleading and unrepresentative benchmark, thinking this would satisfy its enterprise customers and prospects. Redis Labs continually peddle a myth in the marketplace that they are the fastest in-memory store; this blog proves unambiguously that is not true.

Redis Labs markets itself as a better Redis open source. From a performance perspective, we find they are basically equal. Anyone is free to replicate that test; the details are in the body of the benchmark.

We invite Redis Labs to enter into a dialogue with Hazelcast on a fair benchmark, including giving Hazelcast a license key for 48 shards, as previously requested.

Why You Should Attend DataStax Accelerate, from One Developer to Another

$
0
0

Feed: Blog Post – Corporate – DataStax.
Author: Matthias Broecheler, Chief Technologist.

DataStax Accelerate, the world’s premier Apache Cassandra™ conference, is right around the corner, taking place May 21–23 in Washington, D.C.

I couldn’t be more excited.

If you’re a developer, I know what you’re thinking. Why should I attend a conference when all of the talks are going to be available online once the event ends?

On LinkedIn, I recently made the case for why people like us should attend Accelerate in the flesh. In summary:

  • You’ll have lots of opportunities to discuss the precise problems you’re experiencing with dedicated experts who aren’t trying to sell you anything and are simply excited to solve real-world problems.
  • You’ll be able to approach the speakers in person. Yes, their talks will be online after the conference ends. But you can’t ask those key follow-up questions and get your own personal context for the content unless you are in the room.
  • You’ll be able to attend a bootcamp to learn the basics of Cassandra if you’re new to it, and that is not something re-created or available later.
  • You’ll meet like-minded developers you can share database battle stories with.
  • Developers and administrators can take the Apache Cassandra certification exam for free when they attend the conference. Join thousands of engineers already certified on the industry’s most popular, massively scalable NoSQL database.
  • You may even meet someone who introduces you to your next job.

The agenda is set, and no matter your skill level, role, or priorities, there’s something for everyone at DataStax Accelerate. Sessions are organized along these eight tracks:

Here are some of the sessions I’m looking forward to in particular.

Sergiy Smyrnov, lead database architect at Walgreens, will explain why the leading pharmacy retailer chose to build its Rx Microservices Application Stack on top of DataStax. He’ll also explain how Walgreens is using DataStax and Apache Cassandra in the Microsoft Azure Cloud, including a discussion about the implementation of DataStax Enterprise security features, such as LDAP, SSL/TLS, TDE, and audit capabilities.

Pascal Desmarets, CEO of Hackolade, will make the case that while many organizations believe that data modeling is a bottleneck that doesn’t fit in with an agile development approach, that’s a distorted perception. Learn why Desmarets believes data modeling needs to be reinvented for the agile age and how his company is using Cassandra and DataStax Enterprise to make it all happen.

Andrew Prudhomme and Abrar Sheikh of Yelp will explain Data Pipeline, the company’s robust stream processing ecosystem that includes a Cassandra Source Connector that streams data updates made to Cassandra to Kafka in real time. The duo will also discuss how Yelp uses Cassandra CDC and Apache Flink to produce a Kafka stream that contains the full content of each modified row, as well as its previous value.

Mike Treadway, principal cloud architect at IBM, will discuss how to get a Cassandra cluster up and running in Kubernetes—and one that is globally distributed and operationally viable. This session will focus on the technical and operational issues that IBM encountered in this type of environment, as well as the solutions they’ve implemented to solve those problems.

Troy Motte, lead software developer at Siemens, was recently tasked with moving billions of records from MySQL to Apache Cassandra and changing all of the legacy logic. This talk will focus on how he was able to complete the migration gradually and without any major headaches.

That’s just a little taste of what you can expect at the conference. And there’s a whole lot more. Check out the full schedule of sessions here.

Better yet, go ahead and register now! I look forward to seeing you there.

DataStax Accelerate

REGISTER NOW


SHARE THIS PAGE


Why Anyone into Apache Cassandra Needs to Attend Accelerate

$
0
0

Feed: Blog Post – Corporate – DataStax.
Author: Patrick McFadin, VP of Developer Relations.

If you are building modern applications that scale in all sorts of ways, then you’re probably into Apache Cassandra. If you’re attracted to multi-data center replication, fault tolerance, tunable consistency, and the distribution of data across clusters in a masterless architecture, then you are definitely into Cassandra. As a Cassandra enthusiast, you probably need to attend DataStax Accelerate, the world’s premier Cassandra conference, which is taking place May 21–23 near Washington, D.C. This is a great opportunity to spend a couple of focused days learning from others and meeting people just like yourself.

Cassandra has come a long way since it was open sourced more than 10 years ago. I myself have a long history with Cassandra, some of it painful in a “nobody gets it” kind of way. But we all know where the story is headed!

We have collectively pushed through the fear, uncertainty, and doubt to arrive at an interesting point in the history of NoSQL databases and databases in general. The real value behind the transformational technology in Cassandra is leading the migration, intentional or not, to hybrid and multi-cloud computing environments.

Hybrid cloud popularity is building as organizations become more aware of the growing technical debt around data usage and their databases. What’s your cloud strategy? How has that radically changed in the past few years? As of today, it’s hardly ever a bulk migration to one cloud. It’s a mix of old and new, which has put a lot of burden on engineers struggling to make it work.

As an open source project built with scaling in mind, Cassandra is uniquely positioned to be the go-to database of the hybrid cloud future, and DataStax expertise will play a key role in allowing companies to take full advantage of it. We get what you are trying to do and we are here to help.

As a part of that commitment, we are hosting a gathering place for the community of engineers trying to build that future. At Accelerate, we’re looking forward to bringing together all kinds of Cassandra enthusiasts to meet one another, share ideas, and learn how some of today’s leading enterprises are using Cassandra to change the world.

You do not want to miss this event. Just take a quick peek at some Accelerate sessions to give you an idea of what to expect:

Learn how Yahoo! Japan uses Cassandra at scale, spinning up as many as 5,000 servers at any time. Yahoo! Japan has been a driving force for Cassandra in Japan and hosts many awesome community events. They bring a deep expertise you won’t want to miss.

Apache Cassandra™ 4.0 is almost here! This session explores the new features and performance improvements you can expect, as well as the thinking that inspired the architecture of the release. There is no better source of 4.0 information than Cassandra committer  Dinesh Joshi.

Are you getting the most out of your Cassandra cluster? Probably not. Discover how to pinpoint bottlenecks and learn 10 simple ways to boost performance while cutting costs. This is a talk by Jon Haddad, who has been a leader in Cassandra performance for years. You may see him on the user mailing list talking about this topic often. Here is your chance to meet him in person and get some firsthand knowledge.

Instagram has been using Cassandra for years, and their deployment is still growing fast. Learn how the company has worked to improve its infrastructure over the years, increasing the efficiency and reliability of their clusters along the way.

Modern companies are connecting Cassandra and Kafka to stream data from microservices, databases, IoT events, and other critical systems to answer real-time questions and make better decisions. Find out why Cassandra and Kafka should be core components of any modern data architecture.

Everything is an event. To speed up applications, you need to think reactive. The DataStax team has been developing Cassandra drivers for years, and in our latest version of the enterprise driver, we introduced reactive programming. Learn how to migrate a CRUD Java service into reactive and bring home a working project.

See how FamilySearch uses DataStax Enterprise to solve large-scale family history problems, handling three unique workloads in production.

And that’s just the tip of the iceberg. Check out other Cassandra sessions you won’t want to miss here.

See you in D.C.!

DataStax Accelerate

REGISTER NOW


SHARE THIS PAGE

Our Technology Alliance welcomes Azure Cosmos DB

$
0
0

Feed: Cambridge Intelligence.
Author: Catherine Kearns.

The Cambridge Intelligence Technology Alliance features organizations who want to help us solve the most complex graph data visualization challenges faced by our customers. We’re pleased to announce that the latest member to join is Azure Cosmos DB, Microsoft’s globally distributed, multi-model database service.

Cosmos DB logo

Azure Cosmos DB is purposely designed for building high performance, planet-scale applications. It provides a highly-available, massively scalable and secure database platform, with turnkey global distribution available in more regions than any other cloud provider. There’s native support for NoSQL, so developers can choose whichever popular API they prefer to work with. The graph database, based on the Apache TinkerPop Gremlin standard, is one of them.

Seamless integration between Cosmos DB and our expert toolkit technology brings advanced graph visualization to Azure enterprises. It’ll change the way users view and work with their complex connected data in everything from retail and supply chain to fraud detection and IoT.

Corey Lanum, Cambridge Intelligence’s Commercial Director, said:

The volume of data stored by organizations is ever increasing. No matter where they are in the world, users need that data to be available in milliseconds, even during peak times. Azure Cosmos DB provides the enterprise-grade performance and security that our toolkit users need. We know that users will find the insight they’re after when they visualize their Cosmos DB data.

Luis Bosquez, Azure Cosmos DB Program Manager, said:

A key part of every database solution is the ability to visualize it. A lot of Microsoft Azure Cosmos DB customers want to see the connections in their data, and we’re happy that Cambridge Intelligence’s technology provides a solution to make that happen. It’s an ideal collaboration that will benefit joint users around the globe – our reliable, scalable database with Cambridge Intelligence’s data visualization expertise.

About the Technology Alliance

Our Technology Alliance is formed of industry-leading organizations whose products and services complement our own offering. Existing members include Microsoft Services, Neo4j and DataStax. They share our mission to help people understand complex connected data, and offer high quality tools that can be seamlessly integrated with our own.

About Azure Cosmos DB

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service. With a click of a button, Cosmos DB enables you to elastically and independently scale throughput and storage across any number of Azure regions worldwide. You can take advantage of fast, single-digit-millisecond data access using your favorite API including SQL, MongoDB, Cassandra, Tables, or Gremlin. You can create a free Cosmos DB account to get started.

|

Get high-performance scaling for your Azure database workloads with Hyperscale

$
0
0

Feed: Microsoft Azure Blog.
Author: Rohan Kumar.

In today’s data-driven world, driving digital transformation increasingly depends on our ability to manage massive amounts of data and harness its potential. Developers who are building intelligent and immersive applications should not have to be constrained by resource limitations that ultimately impact their customers’ experience.

Unfortunately, resource limits are an inescapable reality for application developers. Almost every developer can recall a time of when database compute, storage and memory limitations impacted an application’s performance. The consequences are real; from the time and cost spent compensating for platform limitations, to higher latency of usability, and even downtime associated with large data operations.

We have already broken limits on NoSQL with Azure Cosmos DB, a globally distributed multi-model database with multi-master replication. We have also delivered blazing performance at incredible value with Azure SQL Data Warehouse. Today, we are excited to deliver a high-performance scaling capability for applications using the relational model, Hyperscale, which further removes limits for application developers.

Hyperscale Explained

Hyperscale is a new cloud-native solution purpose-built to address common cloud scalability limits with either compute, storage, memory or combinations of all three. Best of all, you can harness Hyperscale without rearchitecting your application. The technology implementation of Hyperscale is optimized for different scenarios and customized by database engine.

Announcing:

Azure Database for PostgreSQL Hyperscale

Hyperscale (powered by Citus Data technology) brings high-performance scaling to PostgreSQL database workloads by horizontally scaling a single database across hundreds of nodes to deliver blazingly fast performance and scale. This allows more data to fit in-memory, parallelize queries across hundreds of nodes, and index data faster. This enables developers to satisfy workload scenarios that require ingesting and querying data in real-time, with sub-second response times, at any scale – even with billions of rows. The addition of Hyperscale as a deployment option for Azure Database for PostgreSQL simplifies infrastructure and application design, saving time to focus on business needs. Hyperscale is compatible with the latest innovations, versions and tools of PostgreSQL, so you can leverage your existing PostgreSQL expertise.

Also, the Citus extension is available as an open source download on GitHub. We are committed to partnering with the PostgreSQL community on staying current with the latest releases so developers can stay productive.

Use Azure Database for PostgreSQL Hyperscale for low latency, high-throughput scenarios like:

  • Developing real-time operational analytics
  • Enabling multi-tenant SaaS applications
  • Building transactional applications

Learn more about Hyperscale on Azure Database for PostgreSQL.

Azure SQL Database Hyperscale

Azure SQL Database Hyperscale is powered by a highly scalable storage architecture that enables a database to grow as needed, effectively eliminating the need to pre-provision storage resources. Scale compute and storage resources independently, providing flexibility to optimize performance for workloads. The time required to restore a database or to scale up or down is no longer tied to the volume of data in the database and database backups are virtually instantaneous. With read-intensive workloads, Hyperscale provides rapid scale-out by provisioning additional read replicas as needed for offloading read workloads. 

Azure SQL Database Hyperscale joins the General Purpose and Business Critical service tiers, which are configured to serve a spectrum of workloads. 

  • General Purpose – offers balanced compute and storage, and is ideal for most business workloads with up to 8 TB of storage.
  • Business Critical – optimized for data applications with fast IO and high availability requirements with up to 4 TB of storage.

Azure SQL Database Hyperscale is optimized for OLTP and high throughput analytics workloads with storage up to 100TB.  Satisfy highly scalable storage and read-scale requirements and migrate large on-premises workloads and data marts running on symmetric multiprocessor (SMP) databases. Azure SQL Database Hyperscale significantly expands the potential for application growth without being limited by storage size.

Learn more about Azure SQL Database Hyperscale.

Azure SQL Database Hyperscale is not the only SQL innovation we are announcing today! Azure SQL Database is also introducing a new serverless compute option: Azure SQL Database serverless. This new option allows compute and memory to scale independently based on the workload requirements. Compute is automatically paused and resumed, eliminating the requirements of managing capacity and reducing cost. Azure SQL Database serverless is a fantastic option for applications with unpredictable or intermittent compute requirements.

Learn more about Azure SQL Database serverless.

Build applications in a familiar environment with tools you know

Azure relational databases share more than Hyperscale. They are built upon the same platform, with innovations like intelligence and security shared across the databases so you can be most productive in the engine of your choice.

Trained on millions of databases over the years, these intelligent features:

  • Inspect databases to understand the workloads
  • Identify bottlenecks
  • Automatically recommend options to optimize application performance 

Intelligence also extends to security features like:

  • Advanced threat protection that continuously monitors for suspicious activities
  • Providing immediate security alerts on potential vulnerabilities
  • Recommending actions on how to investigate and mitigate threats

Because we do not rely upon forked versions of our engines, you can confidently develop in a familiar environment with the tools you are used to – and rest assured that your hyperscaled database is always compatible and in-sync with the latest SQL and PostgreSQL versions.

Ready to break the limits?

Hyperscale enables you to develop highly scalable, analytical applications, and low latency experiences using your existing skills on both Azure SQL Database and Azure Database for PostgreSQL. With Hyperscale on Azure databases, your applications will be able to go beyond the traditional limits of the database and unleash high performance scaling. We can’t wait to see what you will create with us.

Summary – Mydbops Database Meetup (Apr-2019)

$
0
0

Feed: Planet MySQL
;
Author: MyDBOPS
;

Conglomeration, Collaboration and Celebration of Database Administrators

Founders of Mydbops envisioned contributing knowledge back to the community. This vision is shaping up in its 3rd edition of the Meetup held on Saturday the 27th of April, 2019. This meetup edition had drawn a good amount of members from the Open Source Database Administrative Community, to the venue.  The core agenda was set on “High Availability concepts in ProxySQL and Maxscale”. There were also presentations in MongoDB Internals along with MySQL Orchestrator and its implementation excellence at Ola (ANI Technologies Pvt. Ltd.)

The participants from various organisations like MariaDB, TeleDNA, CTS, OLA, Infosys, Quikr and more had gathered for the meetup. The participants from various organisations are as depicted here (In Percentage).

Introduction to the session started off with vision and mission of “contribution to the Open Source Database Community”.  History of the past journey towards the 3rd edition of Mydbops Database Meet up was revisited with gratitude of all its contributors. Conglomerate collaboration of Tri-Party was at display – Participants, Organisers and Presenters. 

Keynote by Benedict Henry & Manosh Malai

The first set of speakers touched upon the core topic of ProxySQL High Availability. Reverse Proxy, Replication Breakage Handling, basics of ProxySQL Handling & Topology and ProxySQL Clustering were touched upon by Mr. Aakash M and Mr.Vignesh Prabhu.  They were also MySQL 5.7 Certified, they dealt the concepts and implementation of the ProxySQL High Availability in detail.

Presentation by S.Vignesh & M. Akash – Mydbops

More than a decade long experienced Mr.Pramod R Mahto, Technical Support Engineer at MariaDB Corporation gave an insightful and detailed talk on MariaDB Maxscale – High Availability Solution. His talk focused on Router Filters, Auto-Failover, Auto-Rejoin, Switchover and other relevant modules in MariaDB Maxscale. The detailing was so insightful, the Q&A session went beyond the allotted time and many queries were raised.

Presentation by Pramod R Mahto – MariaDB Corporation

During the High-Tea and Networking break, there was also an appeal for increasing the frequency of these future meet ups. The high and helpful usage of Mydbops’ blogs was highlighted by many of the participants.

Networking

The session reconvened with the add-on topic of “MongoDB wired Tiger Internals: Journey to Transactions” by Mr. Manosh Malai and Mr. Ranjith of Mydbops. Horizontal & Vertical Scalable, Wired Tiger Architecture, Wired Tiger Internals (Sessions/Cursors/Schema), MVCC Workflow and Transaction Data Flush Time. The enthusiasm of the participants on NoSQL and its presenter was witnessed by the number of questions raised and clarified during the Q&A session.

Presentation by Manosh Malai & Ranjith – Mydbops

Healing of MySQL Topology with MySQL Orchestrator was the topic of the last session of the day with Mr. Krishna Ramanathan and Mr.Anil Yadav as the presenters from OLA (ANI Technologies Pvt. Ltd.)  Orchestrator, pre-failover process, healing & post-failover process were the topics of the discussion with the specific demo from the Ola Cabs.

Presentation by Anil Yadav & Krishna Ramanathan – OLA

The speakers were felicitated with gifts and beautiful bouquets by the founders of Mydbops, Karthik P R, Vinoth Kanna R S and Kabilesh P R at the end of event.  Group photo of all the happy and enlightened participants was shot.

Felicitation of the Speakers
3rd Mydbops Database Meetup Conglomerater

Special thanks to Mr. Manosh Malai, Mr. Selva Venkatesh, Mr. Benedict Henry for seamlessly and smoothly organising the event and all the attendees/participants who had actually helped us in making this event a grand success. Thanks for the Venue sponsor Thought factory ( Axis Bank Innovation Lab).

The Next event is tentatively scheduled for July 2019.
Follows us to know the exact date, time and venue of 4th Mydbops
Database Meetup.

Advertisements

Introduction to the Cosmos DB Emulator

$
0
0

Feed: Databasejournal.com – Feature Database Articles.
Author: .

Cosmos DB constitutes one of the Azure foundational services, which ensures its availability in every existing and newly provisioned Azure region. It delivers a wide range of advantages over traditional SQL and NoSQL-based data stores, including, for example, support for multiple consistency levels, latency and throughput guarantees, policy-based geo-fencing, automatic scaling, and multi-master replication model. However, it also represents significant departure from the traditional database management and development approach that most database administrators are familiar with. In this article, we will review several offerings that you can use at no cost to gain hands-on experience with Cosmos DB.

The most straightforward way to administer and interact with Azure Cosmos DB without incurring any monetary charges involves signing up for an Azure Trial subscription, which (at the time this article was written) provides $200 worth of credits during its first month and a 12-month period during which you are entitled to use a subset of services without having to pay for their usage. These free services include Azure Cosmos DB databases of up to 5 GB in size and serving up to 400 request units. As the result, you can implement and test majority of features available with the full-fledged subscriptions, excluding primarily those related to scalability and availability. This approach is geared primarily towards database administrators with no prior experience with Azure Cosmos DB.

Another option, directed towards those interested in Cosmos DB data-related tasks is Cosmos DB Query Playground. This interactive web site helps you to learn about Cosmos DB NoSQL query syntax by providing a set of sample JSON-formatted documents and examples of most common queries. It also allows you to construct custom queries, guiding you through such tasks as filtering, ordering, or geo-spatial indexing and proximity-based operations.

If you are a developer, you will likely be best served by taking advantage of the third option in the form of Cosmos DB Emulator. This allows you to create a single Cosmos DB account with multiple containers (up to 25 fixed or 5 unlimited) and databases, as well as develop applications that interact with them locally on your computer, without relying on Internet connectivity or an Azure subscription. You have the ability to develop against SQL API, as well as Table, MongoDB, Cassandra, and Gremlin. While the underlying implementation of Cosmos DB functionality is obviously different from that of the Azure-based service, the emulation hides all differences, yielding full compatibility from the development standpoint. However, it is important to note that (not surprisingly) features that depend on access to cloud infrastructure and multiple accounts, such as scaling, replication, performance guarantees, and consistency levels are not available.

The emulator is supported only on 64-bit versions of Windows (including Windows Server 2012 R2, Windows Server 2016, Windows Server 2019, and Windows 10) and requires at minimum 2 GB of RAM and 10 GB of free disks space. It is available as a Microsoft Installer package from Microsoft Download Centeror as a Docker for Windows container from Microsoft Container Registry. When using the Microsoft Installer package, you must perform the installation in the security context of an account with local administrative privileges.

Once the installation completes, you can launch the emulator from the Start menu (as any other desktop app) or by running the C:Program FilesAzure Cosmos DB EmulatorCosmosDB.Emulator.execommand line utility. At that point, you will be able to access the emulator user interface in a browser window via https://localhost:8081/_explorer/index.html URL (note that Data Explorer part of that interface is available when using SQL API only). You also have the ability to manage emulator by using PowerShell, leveraging the Microsoft.Azure.CosmosDB.Emulator module included in the installation. The module supports basic administrative tasks, such as starting, stopping or uninstalling the emulator.

In order to connect to the underlying service, you need to provide the master key. The emulator relies on a well-known key value (C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw==), but you have the option of modifying it by running CosmosDB.Emulator.exe with the GenKeyFile option. While connectivity is, be default, limited to the local computer (the emulator listens to incoming requests on TCP port 8081), it is possible to allow inbound traffic from the local network by specifying the /AllowNetworkAccessoption when starting the emulator.

Depending on your choice of programming platform, it might be necessary to carry out post-installation certificate management tasks. This is related to the fact that initial setup of the emulator generates a certificate that is subsequently used to facilitate SSL-based communication with the Cosmos DB account. This certificate is stored in the Local Machine private certificate store. While the store is readily accessible by .NET based code, it does not integrate with Python or Java (which rely on their own certificate stores). To address this incompatibility, you will need to export the certificate and then import it into the store corresponding to your preferred run-time.

With the emulator installed and running, you are ready to start developing Cosmos DB bound applications. Specifics are dependent on the API you chose when creating target Cosmos DB account. For example, for SQL API, you can use either Azure Cosmos DB SDK (available from Nuget repository or Azure Cosmos DB REST API (documented at Azure Cosmos DB: REST API Reference). For Table API, you can use Microsoft Azure Cosmos DB Table Library for .NET, available from (available from Nuget repository). When working with MongoDB, you can leverage MongoDB.NET driver downloadable from MongoDB Ecosystem. To work with Cassandra API, install Python and Cassandra CLI/CQLSH. For Gremlin API development, install apache-tinkerpop-gremlin-console-3.3.4.

Cosmos DB emulator facilitates data transfers by utilizing the functionality of Azure Cosmos DB Data Migration Tool. This open source tool, available from GitHub, supports a range of data sources, including Azure tables, MongoDB databases, SQL Server databases, CSV and JSON files, and Amazon DynamoDB. It also allows you to transfer data to and from Azure Cosmos DB.

In an upcoming articles, I will step through developing sample Cosmos DB applications by using the emulator.

See All Articles by Marcin Policht

Viewing all 521 articles
Browse latest View live