Quantcast
Channel: NoSQL – Cloud Data Architect
Viewing all 521 articles
Browse latest View live

the MySQL Team in Austin, TX

$
0
0

Feed: Planet MySQL
;
Author: Frederic Descamps
;

At the end of the month, some engineers of the MySQL Team will be present in Austin, TX !

We will attend the first edition of Percona Live USA in Texas.

During that show, you will have the chance to meet key engineers, product managers, as well as Dave and myself.

Let me present you the Team that will be present during the conference:

The week will start with the MySQL InnoDB Cluster full day tutorial by Kenny and myself. This tutorial is a full hands-on tutorial where we will start by migrating a classical asynchronous master-replicas topology to a new MySQL InnoDB Cluster. We will then experience several labs were we will see how to maintain our cluster.

If you registerd for our tutorial, please come with a laptop able to run 3 VirtualBox VMs that you can install from a USB stick. So please make free some disk space and install the latest Virtualbox on your system.

This year, I will also have to honor to present the State of the Dolphin, during the keynote.

During the conference, you will be able to learn a lot from our team on many different topics. Here is the list of the session by our engineers:

We will also be present in the expo hall where we will welcome you at our booth. We will show you demos of MySQL InnoDB Cluster and MySQL 8.0 Document Store, where NoSQL and SQL lives in peace together ! Don’t hesitate to visit us during the show.

We will also be present during the Community Dinner and will enjoy hear your thoughts about MySQL !

See you in almost 2 weeks in Texas !


Amazon EC2 I3en instances, offering up to 60 TB of NVMe SSD instance storage, are now generally available

$
0
0

Feed: Recent Announcements.

Today, we are announcing the general availability of storage-optimized Amazon EC2 I3en instances, the largest Non-Volatile Memory Express (NVMe) based SSD storage instance in the cloud. I3en instances offer up to 60 TB of low latency NVMe SSD instance storage and up to 50% lower cost per GB over I3 instances. These instances are designed for data-intensive workloads such as relational and NoSQL databases, distributed file systems, search engines, and data warehouses that require high random I/O access to large amounts of data residing on instance storage. I3en instances also provide up to 100 Gbps of networking bandwidth, up to 96 vCPUs, and up to 768 GiB of memory. In addition, customers can enable Elastic Fabric Adapter (EFA) on I3en for low and consistent network latency. I3en instances are powered by AWS-custom Intel® Xeon® Scalable (Skylake) processors with 3.1 GHz sustained all core turbo performance.

I3en instances come in seven instance sizes, with storage options from 1.25 to 60 TB. I3en instances deliver up to 2 million random IOPS at 4 KB block sizes and up to 16 GB/s of total disk throughput at 128 KB block sizes. These instances are available today in US East (N. Virginia), US West (Oregon), and Europe (Ireland) AWS regions.

Amazon EC2 I3en instances are offered as On-Demand, Reserved, or Spot Instances. For pricing, visit the EC2 pricing page. For details and to get started visit the Amazon EC2 I3en page, AWS Management Console, AWS Command Line Interface (CLI), or AWS SDKs.

The Various Flavors of Two-Phase Commits — Explained (Tech Blog)

$
0
0

Author: .

Internally, NuoDB uses the two-phase commit protocol to manage durability of user data. NuoDB also supports the X/Open XA protocol for synchronizing global transactions across multiple data stores. XA is also sometimes referred to as two-phase commit. The fundamental principles in both protocols are similar, but serve different purposes. Let us explore the difference between these two protocols.

A Single Transaction Across Two Resources

Let us explore a simple use case. A simple application takes messages from one data source (outgoing_messages) and writes them to a new data source (incoming_messages). This is a fairly common use case if you read messages from a message queue such as Apache ActiveMQ and you write them to a database (such as NuoDB). The following code snippet shows this example in the corresponding SQL form.

SQL> select * from outgoing_messages;

ID     MSG

--- ---------

 1 message 1

SQL> start transaction;

SQL> select * from outgoing_messages;

ID     MSG

--- ---------

 1 message 1

SQL> delete from outgoing_messages where id=1;

SQL> insert into incoming_messages(id, msg) values(1, 'message 1');

SQL> commit;

When executed in a single relational database, the two statements (insert and delete) are expected to behave according to the ACID guarantees. They either both succeed or they both fail. Throughout this article we focus on the A(tomic) guarantee of ACID.

XA abstracts away the statements and transaction lifetime for scenarios where the tables live in different data stores. The following example is a simplified version of a ActiveMQ consumer that receives a message and writes it to NuoDB. Due to space constraints in this article, the code does not contain any setup or failure handling.

javax.jms.MessageConsumer consumer=xasession.createConsumer(queue);

       MessageListener listener = new MessageListener() {

           @Override

           public void onMessage(Message msg) {

               TextMessage msg1=(TextMessage)msg;

               NuoXADataSource nuodbDs = new NuoXADataSource();

               XAConnection nuodbXAconn = nuodbDs.getXAConnection(DBA_USER, DBA_PASSWORD);

               XAResource mqRes = xasession.getXAResource();

               XAResource nuoRes = nuodbXAconn.getXAResource();

               nuodbStmt.executeUpdate(String.format("insert into incoming_messages(id, msg) values(1, '%s')", msg1.getText()));

               mqRes.end(xid, XAResource.TMSUCCESS);

               nuoRes.end(xid, XAResource.TMSUCCESS);

               mqRes.prepare(xid);

               nuoRes.prepare(xid);

               mqRes.commit(xid, false);

               nuoRes.commit(xid, false);

           }

       };

Internal Two-phase Commit in NuoDB

When you commit a transaction in NuoDB, you have to wait for a period of time that is equal to a network round trip of the slowest link between the TE that executed the transaction and the slowest SM. For a refresher on the NuoDB architecture, please read this article. The default commit protocol in NuoDB is SAFE, so every transaction needs to be confirmed by all SMs. It is also possible to change the commit protocol to a weaker guarantee that waits for a subset of SMs, effectively trading latency for durability guarantees.

The commit protocol contains three messages. First, the TE informs all the SMs about the commit intent (a pre-commit). Then the TE waits for a confirmation (commit ack) from all the SMs. The third message is broadcast to all engines in the cluster informing them that the commit was successful. Since the transaction does not wait for the third message to be acknowledged, the user never has to wait for it and the commit protocol returns control to the user application (commit succeeds). Once the commit message reaches remote transaction engines, future transactions started on those engines will see all effects of the committed transaction.

SAFE commit guarantees that a transaction will be durable even in the face of catastrophic engine failures. The Reliable Broadcast protocol guarantees that a transaction either gets committed everywhere or nowhere. The third message is not necessary for durability as NuoDB can deduce the correct state of a transaction during recovery from failure. As long as at least one SM survives a catastrophic event, the transaction can be recovered.

The pre-commit message contains the order in which the transaction is going to be committed in the version vector of that engine. If you are unsure what version vectors are and why NuoDB is using them, don’t hesitate to ask in the comment section (or wait for a dedicated blog post on that topic). The important bit is that the order has already been defined and is going to be the same on all engines. All transaction engines will see all commits from the same transaction engine in the same order.

If the transaction engine fails with a catastrophic failure, the cluster reconciles any ongoing transactions and finishes the commit protocol automatically. This means that if a transaction engine fails before the commit message is broadcast, a replacement message will be generated and the transaction will be made visible. This is an important contrast to XA that we further explore in the following sections.

To summarize: the first two messages are used for durability. The last message is used to make the change visible across the cluster.

XA Two-Phase Commit Explained

XA uses the same three messages to commit a transaction across multiple data stores — in our case a message queue and NuoDB. First it sends a pre-commit, which informs the data store about the intent of a commit. The data store answers with a commit-ack and guarantees that it will always be able to commit the transaction in the future. That means, any subsequent conflicting transactions have to wait until the XA transaction gets resolved before they can be executed. Once all data stores answer with a commit-acknowledgement, the transaction can be globally committed. If any of the data stores fails to commit-ack, all data stores are instructed to abandon the pre-committed transaction.

Once all data stores acknowledge the pre-commit, the commit message is broadcast to all data stores. Of course this message arrives at different times. Due to unpredictable network latencies and process lifetimes, the message might arrive at arbitrary times. It is also possible that one of the data stores crashed and the transaction is only partially committed in the other stores. Once the crashed data store comes back up, the transaction manager needs to reconcile the state. This means that the XA transaction manager needs to persist state for a potentially long period of time.

How good is the promise of a commit ack in XA? A data store is not allowed to  retract its guarantee to commit the change after it acknowledged the pre-commit. In modern applications, depending on such guarantee proves to have its limitations. Disks have faults, cables get cut, electricity outages might last a long time. While most data stores can internally guarantee that a pre-commit will always be able to commit, no data store can guarantee that the state can never be lost due to arbitrary external factors. And as such, XA is hard to depend on.

In case of catastrophic failure, when the XA pre-commit is lost and the XA commit can never be completed, undoing a committed transaction from other XA participants can be extremely hard. A torn transaction is a transaction that did not become visible atomically. In other words, only some part of the transaction (such as a single statement) successfully mutated the state of the data store, while others did not. In the case of a catastrophic failure, XA violates A(tomicity) guarantees.

While most applications should be developed with failure in mind, we understand that torn transactions are not the norm. Let us therefore explore the normal case without any failure. Even without failure, XA is never globally Atomic. Let’s return to our original example. The application reads the message from a message queue and writes it to NuoDB. Depending on how the participants receive and process network packets, the message might exist in both, neither or either of the two involved participants at any given time. To make the argument less abstract, let’s imagine another application that reads the state from both data stores at the same time. If XA were globally ACID, there would only be two possible outcomes: the message is in the message queue, but not in NuoDB; or the message is not in the queue but is in NuoDB.

XA transactions are Isolated (the state transition is not observed until the final commit message within one data store), Consistent (same as other transactions, the state transition needs to lead to a valid state within a data store) and Durable (once committed, the state does not get lost unless catastrophic failure happens). XA transactions are not Atomic.

Single Transaction Across Multiple Machines

Let us return to the internal NuoDB commit protocol. Since NuoDB has the ability to recover from failure and the Reliable Broadcast protocol guarantees that all messages make it to all engines, the user is never required to undo a partially committed transaction manually. All transactions are ACID in NuoDB. There is no need for a Transaction Manager that would have to persist state.*

A transaction in NuoDB can never be torn. A single transaction always executes on the same transaction engine. Even if a transaction modifies multiple resources (tables), those modifications are part of the same transaction. This is an important distinction for readers who are familiar with two-phase commits in NoSQL systems.

Summary

As you can see, NuoDB separates the distributed state transition from the durability guarantees as part of the two-phase commit. The first two messages (pre-commit and pre-commit-ack) guarantee that the commit protocol has been satisfied and that a transaction can be recovered after failure. The third message (commit) makes the change atomically visible across the cluster.

While similar to XA, the internal two-phase protocol used by NuoDB does not suffer from the same limitations.

NuoDB supports X/Open XA transactions since version 3.0. When using XA be aware of the inherent XA global atomicity violations.

Ready to try for yourself? Download NuoDB Community Edition now. 

*A transaction manager in XA is a user application that synchronizes the data stores.

The Various Flavors of Two-Phase Commits — Explained

$
0
0

Author: mkysel.

Internally, NuoDB uses the two-phase commit protocol to manage durability of user data. NuoDB also supports the X/Open XA protocol for synchronizing global transactions across multiple data stores. XA is also sometimes referred to as two-phase commit. The fundamental principles in both protocols are similar, but serve different purposes. Let us explore the difference between these two protocols.

A Single Transaction Across Two Resources

Let us explore a simple use case. A simple application takes messages from one data source (outgoing_messages) and writes them to a new data source (incoming_messages). This is a fairly common use case if you read messages from a message queue such as Apache ActiveMQ and you write them to a database (such as NuoDB). The following code snippet shows this example in the corresponding SQL form.

SQL> select * from outgoing_messages;

ID     MSG

--- ---------

 1 message 1

SQL> start transaction;

SQL> select * from outgoing_messages;

ID     MSG

--- ---------

 1 message 1

SQL> delete from outgoing_messages where id=1;

SQL> insert into incoming_messages(id, msg) values(1, 'message 1');

SQL> commit;

When executed in a single relational database, the two statements (insert and delete) are expected to behave according to the ACID guarantees. They either both succeed or they both fail. Throughout this article we focus on the A(tomic) guarantee of ACID.

XA abstracts away the statements and transaction lifetime for scenarios where the tables live in different data stores. The following example is a simplified version of a ActiveMQ consumer that receives a message and writes it to NuoDB. Due to space constraints in this article, the code does not contain any setup or failure handling.

javax.jms.MessageConsumer consumer=xasession.createConsumer(queue);

       MessageListener listener = new MessageListener() {

           @Override

           public void onMessage(Message msg) {

               TextMessage msg1=(TextMessage)msg;

               NuoXADataSource nuodbDs = new NuoXADataSource();

               XAConnection nuodbXAconn = nuodbDs.getXAConnection(DBA_USER, DBA_PASSWORD);

               XAResource mqRes = xasession.getXAResource();

               XAResource nuoRes = nuodbXAconn.getXAResource();

               nuodbStmt.executeUpdate(String.format("insert into incoming_messages(id, msg) values(1, '%s')", msg1.getText()));

               mqRes.end(xid, XAResource.TMSUCCESS);

               nuoRes.end(xid, XAResource.TMSUCCESS);

               mqRes.prepare(xid);

               nuoRes.prepare(xid);

               mqRes.commit(xid, false);

               nuoRes.commit(xid, false);

           }

       };

Internal Two-phase Commit in NuoDB

When you commit a transaction in NuoDB, you have to wait for a period of time that is equal to a network round trip of the slowest link between the TE that executed the transaction and the slowest SM. For a refresher on the NuoDB architecture, please read this article. The default commit protocol in NuoDB is SAFE, so every transaction needs to be confirmed by all SMs. It is also possible to change the commit protocol to a weaker guarantee that waits for a subset of SMs, effectively trading latency for durability guarantees.

The commit protocol contains three messages. First, the TE informs all the SMs about the commit intent (a pre-commit). Then the TE waits for a confirmation (commit ack) from all the SMs. The third message is broadcast to all engines in the cluster informing them that the commit was successful. Since the transaction does not wait for the third message to be acknowledged, the user never has to wait for it and the commit protocol returns control to the user application (commit succeeds). Once the commit message reaches remote transaction engines, future transactions started on those engines will see all effects of the committed transaction.

SAFE commit guarantees that a transaction will be durable even in the face of catastrophic engine failures. The Reliable Broadcast protocol guarantees that a transaction either gets committed everywhere or nowhere. The third message is not necessary for durability as NuoDB can deduce the correct state of a transaction during recovery from failure. As long as at least one SM survives a catastrophic event, the transaction can be recovered.

The pre-commit message contains the order in which the transaction is going to be committed in the version vector of that engine. If you are unsure what version vectors are and why NuoDB is using them, don’t hesitate to ask in the comment section (or wait for a dedicated blog post on that topic). The important bit is that the order has already been defined and is going to be the same on all engines. All transaction engines will see all commits from the same transaction engine in the same order.

If the transaction engine fails with a catastrophic failure, the cluster reconciles any ongoing transactions and finishes the commit protocol automatically. This means that if a transaction engine fails before the commit message is broadcast, a replacement message will be generated and the transaction will be made visible. This is an important contrast to XA that we further explore in the following sections.

To summarize: the first two messages are used for durability. The last message is used to make the change visible across the cluster.

XA Two-Phase Commit Explained

XA uses the same three messages to commit a transaction across multiple data stores — in our case a message queue and NuoDB. First it sends a pre-commit, which informs the data store about the intent of a commit. The data store answers with a commit-ack and guarantees that it will always be able to commit the transaction in the future. That means, any subsequent conflicting transactions have to wait until the XA transaction gets resolved before they can be executed. Once all data stores answer with a commit-acknowledgement, the transaction can be globally committed. If any of the data stores fails to commit-ack, all data stores are instructed to abandon the pre-committed transaction.

Once all data stores acknowledge the pre-commit, the commit message is broadcast to all data stores. Of course this message arrives at different times. Due to unpredictable network latencies and process lifetimes, the message might arrive at arbitrary times. It is also possible that one of the data stores crashed and the transaction is only partially committed in the other stores. Once the crashed data store comes back up, the transaction manager needs to reconcile the state. This means that the XA transaction manager needs to persist state for a potentially long period of time.

How good is the promise of a commit ack in XA? A data store is not allowed to  retract its guarantee to commit the change after it acknowledged the pre-commit. In modern applications, depending on such guarantee proves to have its limitations. Disks have faults, cables get cut, electricity outages might last a long time. While most data stores can internally guarantee that a pre-commit will always be able to commit, no data store can guarantee that the state can never be lost due to arbitrary external factors. And as such, XA is hard to depend on.

In case of catastrophic failure, when the XA pre-commit is lost and the XA commit can never be completed, undoing a committed transaction from other XA participants can be extremely hard. A torn transaction is a transaction that did not become visible atomically. In other words, only some part of the transaction (such as a single statement) successfully mutated the state of the data store, while others did not. In the case of a catastrophic failure, XA violates A(tomicity) guarantees.

While most applications should be developed with failure in mind, we understand that torn transactions are not the norm. Let us therefore explore the normal case without any failure. Even without failure, XA is never globally Atomic. Let’s return to our original example. The application reads the message from a message queue and writes it to NuoDB. Depending on how the participants receive and process network packets, the message might exist in both, neither or either of the two involved participants at any given time. To make the argument less abstract, let’s imagine another application that reads the state from both data stores at the same time. If XA were globally ACID, there would only be two possible outcomes: the message is in the message queue, but not in NuoDB; or the message is not in the queue but is in NuoDB.

XA transactions are Isolated (the state transition is not observed until the final commit message within one data store), Consistent (same as other transactions, the state transition needs to lead to a valid state within a data store) and Durable (once committed, the state does not get lost unless catastrophic failure happens). XA transactions are not Atomic.

Single Transaction Across Multiple Machines

Let us return to the internal NuoDB commit protocol. Since NuoDB has the ability to recover from failure and the Reliable Broadcast protocol guarantees that all messages make it to all engines, the user is never required to undo a partially committed transaction manually. All transactions are ACID in NuoDB. There is no need for a Transaction Manager that would have to persist state.*

A transaction in NuoDB can never be torn. A single transaction always executes on the same transaction engine. Even if a transaction modifies multiple resources (tables), those modifications are part of the same transaction. This is an important distinction for readers who are familiar with two-phase commits in NoSQL systems.

Summary

As you can see, NuoDB separates the distributed state transition from the durability guarantees as part of the two-phase commit. The first two messages (pre-commit and pre-commit-ack) guarantee that the commit protocol has been satisfied and that a transaction can be recovered after failure. The third message (commit) makes the change atomically visible across the cluster.

While similar to XA, the internal two-phase protocol used by NuoDB does not suffer from the same limitations.

NuoDB supports X/Open XA transactions since version 3.0. When using XA be aware of the inherent XA global atomicity violations.

Ready to try for yourself? Download NuoDB Community Edition now. 

*A transaction manager in XA is a user application that synchronizes the data stores.

Why Open Source Software Is the Key to Hybrid Cloud

$
0
0

Feed: Blog Post – Corporate – DataStax.
Author: Louise Westoby, Senior Director, Product Marketing.

In previous posts, we’ve talked about the benefits of the hybrid cloud. We’ve also offered some guidance about making the transition to the hybrid cloud. So you know about the “whys” and “hows” of moving to the hybrid cloud.

But those are really just two parts of a trifecta. The third part, and the topic of this post, focuses on one of the main keys to success with hybrid cloud: open source software.

The age of open source has arrived

As a recent article in Linux Journal observed, attitudes about open source software have significantly evolved over the last 30 years. Once regarded quite dubiously by mainstream enterprises, open source is now freely accepted and utilized by many of the world’s most successful organizations, including Netflix, Apple, and the United States government. Some of the leading tech companies that were early enemies of open source are now among its most strident and impactful supporters.

Back in 2009 the open software market was forecast to top $8 billion within a few years. But the market blew past $11 billion in 2017, and the current MarketsandMarkets™ forecast predicts that the open source software services market will hit nearly $33 billion by 2022. That will represent a compound annual growth rate of almost 24%.

So the age of open source has truly arrived. Case in point is Googles recently announced partnering with many open source service providers, including DataStax. But why is open source so essential for hybrid cloud success?

Enabling hybrid cloud success with open source software

Consider the definition of hybrid cloud: A computing environment that incorporates infrastructure from multiple platforms and data centers.

That single, simple sentence encapsulates the power of the hybrid cloud. Because it enables organizations to utilize the optimal set of resources for their application workloads, hybrid cloud offers businesses and public sector agencies the ability to maximize the benefits of the cloud while minimizing costs.

How DataStax leverages open source

DataStax offers the only active everywhere database for hybrid cloud. It is, quite simply, the best available distributed database for hybrid cloud environments. But without open source, the DataStax solution wouldn’t be possible. That’s because DataStax utilizes the power of Apache Cassandra™, an open source, distributed, NoSQL database.

Cassandra provides five key benefits that make hybrid cloud and DataStax Enterprise (DSE) so powerful:

  • Scalability: Cassandra has the ability to both scale up (adding capacity to a single machine) and scale out (adding more servers) to tens of thousands of nodes.
  • High availability: Cassandra’s masterless architecture enables the quick replication of data across multiple data centers and geographies. This feature powers the always-on benefit offered by DataStax.
  • High fault tolerance: Cassandra provides automated fault tolerance. Cassandra’s masterless, peer-to-peer architecture and data replication capabilities support a level of system redundancy that ensures full application speed and availability, even when nodes go offline. And this capability is fully automated—no manual intervention is required.
  • High performance: In today’s business climate speed is essential. Cassandra architecture minimizes the instances of high latency and bottlenecks that so often stifle productivity, and frustrate both internal users and customers. This is further enhanced in DSE, which offers more than 2x the performance of standalone open source Cassandra.
  • Multi-data center and hybrid cloud support: Designed as a distributed system, Cassandra enables the deployment of large numbers of nodes across multiple data centers. It also supports cluster configurations optimal for geographic distributions, providing redundancy for failover and disaster recovery.

Simply put, the very best database solution for the hybrid cloud just wouldn’t exist without open source software. And that’s just one example among countless others that serves to illustrate how essential open source is in enabling the best benefits of the hybrid cloud.

Remember: there’s no such thing as free open source software

As everyone likely knows, the acronym “FOSS” stands for free, open source software. But putting open source software to work for your organization’s hybrid cloud initiative will not be free. As this ComputerWeekly article observed, “Open source is not ‘free’ because every deployment has a cost associated with it, but what it does allow for is choice, and an organization can go down one of two routes in adopting it.”

Those two routes, or choices, in implementing open source are:

  • Install, manage, and support the open source software yourself and incur the training and learning-curve costs
  • Pay a professional open source services provider to implement and support the open source solution

And sometimes, a combo of the two approaches might be best.

But even though FOSS isn’t free, it can provide you with a wonderful range of lower-cost options for transitioning to hybrid cloud. And you can at least get started with free training on DataStax Academy and you can also download DataStax Distribution of Apache Cassandra™ for free.

Apache Cassandra™ Architecture (White Paper)

READ NOW


SHARE THIS PAGE

Azure Marketplace new offers – Volume 36

$
0
0

Feed: Microsoft Azure Blog.
Author: Christine Alford.

Bluefish Editor on Windows Server 2016

Bluefish Editor on Windows Server 2016: Apps4Rent helps you deploy Bluefish Editor on Azure. Bluefish, a free software editor with advanced tools for building dynamic websites, is targeted as a middle path between simple editors and fully integrated development environments.

BOSH Stemcell for Windows Server 2019

BOSH Stemcell for Windows Server 2019: This offer from Pivotal Software provides Windows Server 2019-based Stemcell for the Pivotal Cloud Foundry platform.

Corda Opensource VM

Corda Opensource VM: R3’s Corda is an open-source blockchain platform that removes costly friction in business transactions by enabling institutions to transact directly using smart contracts and ensures privacy and security.

Datastax Distribution of Apache Cassandra

DataStax Distribution of Apache Cassandra: DataStax offers a simple, cost-effective way to run the Apache Cassandra database in the cloud. DDAC addresses common challenges with adoption, maintenance, and support by streamlining operations and controlling costs.

DataStax Enterprise

DataStax Enterprise: DataStax delivers the always-on, active-everywhere, distributed hybrid cloud NoSQL database built on Apache Cassandra. DataStax Enterprise (DSE) makes it easy for enterprises to exploit hybrid and multi-cloud environments via a seamless data layer.

FatPipe WAN Optimization for Azure

FatPipe WAN Optimization for Azure: Significantly boost wide area network performance with FatPipe WAN optimization, which appreciably increases utilization, providing effective use of bandwidth by caching/compressing that sharply reduces redundant data.

Flexbby One RU Edition

Flexbby One RU Edition: Get a comprehensive solution for complex workflow automation in sales, marketing, service, HR, and legal. Flexbby One is powerful software to help you manage the contract lifecycle, document archiving, procurement, customer service, and more.

Flowmon Collector for Azure

Flowmon Collector for Azure: Flowmon Collector serves for collection, storage, and analysis of flow data (NetFlow, IPFIX). Flowmon is a comprehensive platform that includes everything you need to get absolute control over your network through network visibility.

Innofactor QualityFirst

Innofactor QualityFirst: Get QualityFirst by Innofactor for healthcare, patient, and care instructions.

Keycloak Gatekeeper Container Image

Keycloak Gatekeeper Container Image: Keycloak Gatekeeper is an adapter that integrates with Keycloak authentication supporting access tokens in browser cookie or bearer tokens. This Bitnami Container Image is secure, up-to-date, and packaged using industry best practices.

MIKE Zero

MIKE Zero: This MIKE modeling suite from DHI A/S helps engineers and scientists who want to model water environments, and includes most of MIKE Powered by DHI’s inland and marine software.

System Integrity Management Platform (SIMP) 6.3

System Integrity Management Platform (SIMP) 6.3: SIMP is an open-source framework that can either enhance your existing infrastructure or allow you to quickly build one from scratch. Built on the Puppet product suite, SIMP is designed around scalability, flexibility, and compliance.

IBM Streams: A 10-year anniversary, and what’s next

$
0
0

Feed: IBM Big Data & Analytics Hub – All Content
;
Author: roger-rea
;

IBM Streams was first available on May 15, 2009, exactly 10 years ago today. Happy anniversary!

From System S to Streams

IBM Streams evolved from a five-year collaboration between IBM Research and the U.S. government. The original goals were to:

  • Create a system to ingest unprecedented volumes of data
  • Analyze data arriving at extremely high velocity
  • Handle a variety of data, both structured and unstructured.

You can glean some Streams history from the IBM Research pages for System S project. The “S” was short for Streams, to play off the earlier successf of Project R from IBM Research – the “R” was for “relational.” The publications tab for System S includes more than 120 technical white papers dating back to 2004. The first one published was “Interval query indexing for efficient stream processing.”

A few early adopters like the University of Ontario Institute of Technology, University of Uppsala and KTH Royal Institute of Technology in Stockholm helped shape the runtime and language. This short video describes their use cases, which include neonatal intensive care unit monitoring, space weather prediction and traffic monitoring. Since then, businesses have used Streams to unlock value throughout the enterprise. One transportation company was documented to have 150 percent ROI using Streams.

Just before the seventh release of System S in 2008, IBM decided to turn this IBM Research project into an offering. System S version 3.2 included runtime and programming model that contained 10 operators and the ability to extend with custom C/C++ or Java programs. Advanced capabilities included back pressure and an optimizing compiler to spread applications across a clustered runtime.

Soon after Streams v1.0 became available, the product’s focus was condensed to reflect volume, velocity and variety.

The first few releases continued use of the early version of Stream Processing Language (SPL) called the Stream Processing Application Declarative Engine. With Version 2.0, a major effort was made to simplify and standardize SPL so all operators would behave consistently to simplify learning and development.

Since the last Research System S release 10 years ago, 15 major releases added new functionality, described in the timeline above. We put a lot of emphasis on developer tools and added a wealth of analytics capabilities.

A few highlights of functions added to Streams over the years:

  • Dozens of analytic capabilities like native machine learning, model scoring, timeseries analysis, rules. forecasting, and geospatial data analysis.
  • From the original 10 operators, now over 200 operators
  • Visualization of streaming data
  • Visual drag-and-drop development in 2012
  • Java development in 2015 and Python development in 2016
  • Apache Beam development in 2018
  • At least once and exactly once data processing

Thanks to Streams, we made many contributions to the open source community. Scores of additional operators are available on GitHub, including:

  • Connectors for NoSQL Key-Value stores like MongoDB and Redis
  • OpenCV toolkit and operators for video and image analytics.
  • Plugins for VSCode and Atom allow creating SPL programs using popular editors.

Streams v5.0 brings real-time analytics to the private cloud

The latest release, Streams v5.0 for IBM Cloud Private for Data (ICP for Data) provides a real-time engine within our data platform. The platform simplifies bringing artificial intelligence (AI) into your enterprise processes. It can collect, organize, analyze and infuse AI into your business. Streams is ideally suited for taking your AI models and infusing them throughout your company. Watch this webinar to learn more about Streams on the IBM Cloud Private for Data platform.

One customer is running more than 1,700 AI models in Streams and achieving the following:

  • Revenues up 50 Percent with improved click-stream advertising
  • Enhanced chat bot conversations
  • Predict and anticipate customers calling their call centers

By anticipating callers and sending recommendations to solve problems on lower-cost channels, they expect to have more than one million calls handled this year through lower-cost channels.

Over the past 10 years, IBM Streams has led the industry in the streaming analytics market with some of the most advanced use cases across many industries. As we continue to help companies infuse AI into their business processes with continuous intelligence, IBM aims to help clients drive down costs and increase revenues to improve outcomes.

If you haven’t tried out Streams yet, you’ve missed ten years of opportunity. Isn’t it time to see what real streaming analytics is all about? Visit IBM Streams to learn more and try it out.

Structuring Your Unstructured JSON data

$
0
0

Feed: Planet MySQL
;
Author: Dave Stokes
;

The world seems awash in unstructured, NoSQL data, mainly of the JSON variety.  While this has a great many benefits as far as data mutability and not being locked into a rigid structure there are some things missing that are common in the structured world of SQL databases.

What if there was a way to take this unstructured NoSQL JSON data and cast it, temporarily, into a structured table?  Then you could use all the processing functions and features found in a relation database on you data.  There is a way and it is the JSON_TABLE function.

JSON_TABLE

You can find the documentation for JSON_TABLE here  but there are some examples below that may make learning this valuable function easier than the simple RTFM.

I will be using the world_x dataset for the next example

If we run a simple SELECT JSON_PRETTY(doc) FROM countryinfo LIMIT 1;  the server will return something similar to the following:

{
  “GNP”: 828,
  “_id”: “ABW”,
  “Name”: “Aruba”,
  “IndepYear”: null,
  “geography”: {
    “Region”: “Caribbean”,
    “Continent”: “North America”,
    “SurfaceArea”: 193
  },
  “government”: {
    “HeadOfState”: “Beatrix”,
    “GovernmentForm”: “Nonmetropolitan Territory of The Netherlands”
  },
  “demographics”: {
    “Population”: 103000,
    “LifeExpectancy”: 78.4000015258789
  }
}

We can use JSON_TABLE to extract the Name, the Head of State, and the Governmental Form easily with the following query. If you are not used to the MySQL JSON Data type, the “$” references the entire document in the doc column (and doc is out JSON data type column in the table).  And notice that the $.government.HeadOfState and $.government.GovernmentForm are the full path to the keys in the document.

select jt.* FROM countryinfo,
json_table(doc, “$” COLUMNS (
          name varchar(20) PATH “$.Name”,
          hofstate varchar(20) PATH ‘$.government.HeadOfState’,
          gform varchar(50) PATH ‘$.government.GovernmentForm’)) as jt
limit 1;

The syntax is JSON_TABLE(expr, path COLUMNS (column_list) [AS] alias) where expr is either a column column from a table or a JSON document passed to the function (‘{“Name”: “Dave”}’ as an example).  Then the desired columns are specified where we name the new column, give it a relational type, and then specify the path of the JSON values we want to cast.


And the results are in the form of a relational table.

+——-+———-+———————————————-+
| name  | hofstate | gform                                        |
+——-+———-+———————————————-+
| Aruba | Beatrix  | Nonmetropolitan Territory of The Netherlands |
+——-+———-+———————————————-+

This is JSON_TABLE in its most basic form.  The only thing I would like to emphasize is that the keys of the JSON data are case sensitive and it is import to check your spelling!

Data Problems

There is also a nice feature to JSON_TABLE where you assign a default value if that key/value pair is missing or yet another value if there is something can not be cast. If we use a non-existent key/value pair here named ‘xyz‘ for an example, we can insert the value ‘888’ for any JSON document missing values.

select jt.* FROM countryinfo,
json_table(doc, “$” COLUMNS (
          name varchar(20) PATH “$.Name”,
          hofstate varchar(20) PATH ‘$.government.HeadOfState’,
          xyz int(4) PATH ‘$.xyz’ DEFAULT ‘999’ ON ERROR DEFAULT ‘888’ ON EMPTY,
          gform varchar(50) PATH ‘$.government.GovernmentForm’)) as jt
limit 1;

And how the result looks:

+——-+———-+—–+———————————————-+
| name  | hofstate | xyz | gform                                        |
+——-+———-+—–+———————————————-+
| Aruba | Beatrix  | 888 | Nonmetropolitan Territory of The Netherlands |
+——-+———-+—–+———————————————-+

NULL Handling

Now be careful with Null values. If you change the new line to  xyz int(4) PATH ‘$.IndepYear’ DEFAULT ‘999’ ON ERROR DEFAULT ‘888’ ON EMPTY, we can easily see that the NULL value for Aruba’s year of independence will return the default ‘999‘ value.  And if you change the path to ‘$.Name’ to try and force the string value into the integer it will trake the ON ERROR path.


And you can assign missing values to NULL 

Nested Path Data



Iterating nested arrays can be painful but JSON_TABLE makes it very simple.  So creating some dummy data, we can start work on digging through the nested information.


select * from a;
+—-+———————–+
| id | x                     |
+—-+———————–+
|  1 | {“a”: 1, “b”: [1, 2]} |
|  2 | {“a”: 2, “b”: [3, 4]} |
|  3 | {“a”: 3, “b”: [5, 6]} |

+—-+———————–+

The query features the NESTED PATH argument

select d.* FROM a, 
JSON_TABLE(x, “$” columns 
        (mya varchar(50) PATH “$.a”, 
NESTED PATH “$.b[*]” 
                columns (myb int path ‘$’))
) as d;

The output.

+—–+—–+
| mya | myb |
+—–+—–+
| 1   |   1 |
| 1   |   2 |
| 2   |   3 |
| 2   |   4 |
| 3   |   5 |
| 3   |   6 |
+—–+—–+
6 rows in set (0.0013 sec)

Not bad but lets add another level.

select * from b;
+—-+—————————————————–+
| id | x                                                   |
+—-+—————————————————–+
|  1 | {“a”: 2, “b”: [{“c”: 101, “d”: [44, 55, 66]}]}      |
|  2 | {“a”: 1, “b”: [{“c”: 100, “d”: [11, 22, 33]}]}      |
|  3 | {“a”: 3, “b”: [{“c”: 102, “d”: [77, 88, 99, 101]}]} |
+—-+—————————————————–+
3 rows in set (0.0009 sec)

So lets embed another level

select d.* FROM b,  
       JSON_TABLE(x, “$” columns          
        (mya varchar(50) PATH “$.a”,  
          NESTED PATH “$.b[*]”          
           columns (myc int path ‘$.c’, 
           NESTED PATH ‘$.d[*]’  
           columns (dpath int path ‘$’))) 
) as d order by myc;
+—–+—–+——-+
| mya | myc | dpath |
+—–+—–+——-+
| 1   | 100 |    22 |
| 1   | 100 |    33 |
| 1   | 100 |    11 |
| 2   | 101 |    44 |
| 2   | 101 |    55 |
| 2   | 101 |    66 |
| 3   | 102 |    77 |
| 3   | 102 |    88 |
| 3   | 102 |    99 |
| 3   | 102 |   101 |
+—–+—–+——-+
10 rows in set (0.0006 sec)

And we can get ordinal numbers too.

select d.* FROM b,  
         JSON_TABLE(x, “$” columns          
        (mya varchar(50) PATH “$.a”,  
           NESTED PATH “$.b[*]”          
           columns (myc int path ‘$.c’, 
           nested path ‘$.d[*]’  
           columns (dcount for ordinality
           dpath int path ‘$’))) ) as d 
order by dpath;
+—–+—–+——–+——-+
| mya | myc | dcount | dpath |
+—–+—–+——–+——-+
| 1   | 100 |      1 |    11 |
| 1   | 100 |      2 |    22 |
| 1   | 100 |      3 |    33 |
| 2   | 101 |      1 |    44 |
| 2   | 101 |      2 |    55 |
| 2   | 101 |      3 |    66 |
| 3   | 102 |      1 |    77 |
| 3   | 102 |      2 |    88 |
| 3   | 102 |      3 |    99 |
| 3   | 102 |      4 |   101 |
+—–+—–+——–+——-+
10 rows in set (0.0009 sec)

And not that we have the data structured, we can start using the WHERE clause such as  where myc > 100 and dpath < 100.





Where you can find MySQL in June-July 2019

$
0
0

Feed: Planet MySQL
;
Author: Oracle MySQL Group
;

As continue of our previous announcement from March 14, please find below a list of shows & conferences where you can find MySQL team at:

  • June 2019:
    • OpenSource Conference, Hokkaido, Japan, May 31-June 1, 2019
      • As we already announced MySQL is a Gold sponsor of this OS show. Our MySQL local team is going to staff the MySQL booth in the expo area as well having a MySQL talks as follows:
        • “MySQL Update + MySQL InnoDB Cluster” given by Daisuke Inagaki, the MySQL Principal Solutions Engineer. His talk is being hold on May 31 @13:55.
        • “MySQL – latest information & peripheral information by Kei Skai & Yoku0825” this talk is given by MyNA(MySQL Nippon Association) by Kei Sakai & Yoku0825, the Vice Presidents of MyNA. The talk is scheduled for June 1 @11:00. 
    • DOAG – Databanken, Dusseldorf, Germany, June 3-4, 2019
      • MySQL Community Team is Gold sponsor of this Oracle User Community event in June. We are having a booth in the expo area as well as approved MySQL talk on “MySQL Cloud/MySQL Analytic” given by Carsten Thalheimer, the MySQL Master Principal Sales Consultant.
    • DevTalks Romania, Bucharest, Romania, June 6-7, 2019
      • MySQL Community Team is a Silver sponsor of this show with a MySQL talk & booth in the expo area. This year we are going to share the booth with a Oracle Bronto (NetSuite Development) team.
      • You can find MySQL talk in the agenda as follows:
        • “What’s New in MySQL 8.0 Security” given by Georgi Kodinov, the Team Lead, MySQL Server General Team. His talk is scheduled for June 6 @15:40-16:20.
      • We are looking forward seeing & talking to you at DevTalks this year!
    • BGOUG, Pravets, Bulgaria, June 7-9, 2019
      • As tradition MySQL Community Team & Oracle are the partners of this Bulgarian Oracle User Community event in Pravets. This year the Sr. MySQL Sales Consultant, Vittorio Cioe will be talking about “MySQL InnoDB Cluster: High availability with no stress!”.
    • Hong Kong Open Source Conference, Hong Kong, June 14-15, 2019
      • Again this year we are part of this biggest OS show in Hong Kong. We are Silver sponsor with a MySQL booth in the expo area. Also you can find three MySQL related talks & one talk from Oracle GraalVM on the second day morning MySQL track.
      • We are looking forward talking to you at HKOS this year!
    • SouthEast Linux Fest (SELF), Charlotte, US, June 14-16, 2019
      • We are again back at SELF as Diamond sponsor this year. We are going to have MySQL booth in the expo area as well as MySQL talk(s) hopefully to be approved. We have submitted several proposals, so please watch organizers website for schedule update to see when & where our talks will be hold.
      • We are looking forward talking to you @SELF 2019!
    • OpenExpo Europe, Madrid, Spain, June 20, 2019
      • MySQL Community Team in cooperation with Oracle Linux team are Gold sponsor of OpenExpo Europe this year. You can find our staff at MySQL/Linux booth in the expo area as well as find MySQL & Linux talks in the schedule.
      • For MySQL the talk will be given by Keith Hollman, the MySQL Principal Sales Consultant. Keith’s talk on “MySQL 8.0: Highly Available, JSON, NoSQL & Document Store straight out-of-the-box!” is scheduled for Jun 20 @16:40-17:10.
    • DataOps Barcelona, Barcelona, Spain, June 20-21, 2019
      • MySQL Community Team is a Community sponsor of DataOps Barcelona with a talk on: “Deep dive into MySQL Group Replication: the magic explained” given by Frederick Descamps, the MySQL Community Manager. His talk is scheduled for June 20 @11:30 in the Room A-4.
    • RootConf 2019, Bangalore, India, June 21-22, 2019
      • MySQL Community team with the local MySQL team are Bronze partner of RootConf this year. We are going to have a MySQL booth in the expo area. We should also have an approved MySQL talk which is not yet in the schedule. Please watch organizers’ website for further updates.
  • July 2019:
    • OpenSource Conference, Nagoya, Japan, July 13, 2019
      • MySQL Community team in cooperation with local MySQL are Gold sponsor of this OS show in Nagyoa. You can find us at the MySQL booth in the expo area as well as find MySQL talk in the schedule as follows:
        • “MySQL Update” by Machiko Ikoma, the MySQL Principal Solution Engineer.
      • We look forward talking to you at OSC Nagyoa!
    • FOSS4G Hokkaido, Japan, July 13-14, 2019
      • MySQL is a partner of OSGeo.JP group which is organizing FOSS4G event. At this moment we are working on the seminar topic as well as on other benefits MySQL will get. Please stay tuned for further updates.

‘);
}
else {
var loc = middle + 1;
$(“p:eq(“+loc+”)”).append(”);
}
}
else if( location == “end” ) {
$(“p:eq(“+end+”)”).append(”);
}
else {

}
}

function displayAds(adgroup1,adgroup2,adgroup3)
{
if( count > 17 )
{
if( adgroup1 != “NONE PROVIDED” )
{
if( $(“p:eq(4)”).text().length > 100) {
$(“p:eq(4)”).append(”);
}
else {
$(“p:eq(5)”).append(”);
}
}
if( adgroup2 != “NONE PROVIDED” )
{
if( $(“p:eq(10)”).text().length > 100) {
$(“p:eq(10)”).append(”);
}
else {
$(“p:eq(11)”).append(”);
}
}
if( adgroup3 != “NONE PROVIDED” )
{
if( $(“p:eq(16)”).text().length > 100) {
$(“p:eq(16)”).append(”);
}
else {
$(“p:eq(17)”).append(”);
}
}
}
else if( count > 11 )
{
if( adgroup1 != “NONE PROVIDED” )
{
if( $(“p:eq(4)”).text().length > 100) {
$(“p:eq(4)”).append(”);
}
else {
$(“p:eq(5)”).append(”);
}
}
if( adgroup2 != “NONE PROVIDED” )
{
if( $(“p:eq(10)”).text().length > 100) {
$(“p:eq(10)”).append(”);
}
else {
$(“p:eq(11)”).append(”);
}
}
}
else if( count > 5 )
{
if( adgroup1 != “NONE PROVIDED” )
{
if( $(“p:eq(4)”).text().length > 100) {
$(“p:eq(4)”).append(”);
}
else {
$(“p:eq(5)”).append(”);
}
}
}

}

DataStax Constellation: A Quick Technical Rundown

$
0
0

Feed: Blog Post – Corporate – DataStax.
Author: Robin Schumacher, SVP and Chief Product Officer.

We’re excited to announce DataStax Constellation, our new cloud platform that’s designed to bring a whole new level of sophistication and capabilities to the Database as a Service (DBaaS) market in a very cost-effective manner.

Constellation is a major addition to our already available powerful set of cloud capabilities and solutions. And while it’s true that other DBaaS offerings have been on the market for a while now, Constellation makes them look like a Blackberry to our iPhone.

What Is Constellation?

Constellation is a cloud platform that delivers a number of services designed to make building modern database applications in the cloud quick, simple, and cost effective. The first two services in Constellation are DataStax Apache Cassandra as a Service and DataStax Insights.

Our three goals with the initial release of Constellation are to (1) simplify Cassandra deployments and management, (2) accelerate cloud application development with auto-configured developer tooling, (3) provide fast and smart problem identification/resolution with our next-generation performance optimization solution.

As to cloud vendor support, Constellation will first support Google Cloud Platform, then Amazon Web Services, followed by Microsoft Azure.

Why DataStax in the Cloud

There’s a reason why over 60% of our customers run our software in the cloud today, with two-thirds saying they’ll be in the cloud with us in the next 6 to 12 months. It’s the same reason we get gasps from attendees at our Developer Day events when we show off our multi-cloud demo.

No other database can do what we do in the cloud.

This is because our database platform is built on the proven open source foundation of Apache Cassandra™ and paired with advanced functionality that modern enterprises need. The powerful combination of Cassandra’s unique masterless architecture and DataStax’s enterprise feature set delivers the speed, constant uptime, transparent data distribution, security, and scale that brings to life everything a cloud database is meant to be.

With Constellation, this fact becomes even more “easy and obvious” to understand.

DataStax Apache Cassandra as a Service

The cornerstone service of Constellation is our own DataStax Apache Cassandra as a Service, which leapfrogs all currently existing as-a-service Cassandra offerings on the market. Whether it’s processing ecommerce transactions, performing product searches, or ingesting IoT data from billions of devices, no other DBaaS will do it faster than Constellation’s DataStax Apache Cassandra as a Service.

Further, no Cassandra as a Service is more secure than what we have in Constellation. Whether it’s advanced authentication and authorization, object permission management, data encryption (on disk and network), role-based access control, data auditing, row-level access control, enterprise-secured drivers, etc., all the bases are covered when it comes to data protection.

Constellation’s DataStax Apache Cassandra as a Service is also the most sophisticated when it comes to auto-management and self-healing functionality. Smart backups, transparent repair operations, automated traffic governance, and more all exist in DataStax Apache Cassandra as a Service.

Accelerated Cloud Development

Developers will love how Constellation’s DataStax Apache Cassandra as a Service makes application development fast and easy. From within Constellation’s cloud console, a developer can, with one click, be taken to a web-hosted version of DataStax Studio and start coding right away.

For those wanting to do traditional development on their laptops, a single click in the web console will download a personalized set of cloud developer tools and utilities that are automatically secured and configured to allow for bulk loading, query management, and application development with their favorite development languages.

Performance Insights

In addition to DataStax Apache Cassandra as a Service we’re pleased to announce DataStax Insights, which is our next-generation performance management and optimization solution. Our own support team will begin using Insights to assist Constellation DataStax Apache Cassandra as a Service customers, and shortly thereafter, Insights will be available for customers to utilize for their DataStax Apache Cassandra as a Service clusters and then on-premises DataStax deployments. This is especially key for customers who want a single view of their hybrid cloud deployments.

Three things make Insights stand out from among existing DBaaS performance tools. First is its coverage for both cloud and on-premises deployments.

Next is its productivity-increasing mechanisms for communicating where busy admins and operators need to spend their time. For example, Insight’s Health Index conveys, with a single measurement, the overall well-being of every cluster being monitored. Administrators—whether novices or experts—immediately know which clusters require their attention and which are already optimized.

Third, Insights uses AI-powered capabilities to learn what’s normal and abnormal for each particular cluster and customizes recommendations for improvements based on that learning.

What about Cost?

All this functionality is awesome, but we knew that we needed to price Constellation services properly to be competitive. The great thing is, our DataStax Apache Cassandra as a Service is so much more resource-efficient than other DBaaS offerings that oftentimes fewer resources are required for the same workload.

To test this claim, we benchmarked DataStax Apache Cassandra as a Service against a top NoSQL DBaaS with an IoT workload. The end result was that DataStax Apache Cassandra as a Service cost 38% less for read workloads and 84% less for write-intensive workloads. Not bad!

Give Constellation a Try

With the introduction of Constellation, we now have all the bases covered when it comes to deployment flexibility, both on-premises and in the cloud. Whether you’re a do-it-yourselfer in the cloud, someone who wants to use DataStax software in a cloud marketplace, looking for a completely managed service in the cloud, or you’re needing to deploy a TCO-friendly and superhero styled DBaaS in the cloud, DataStax has what you need.

We have more information on Constellation waiting for you on our website, so be sure to check out everything we have available now.

DATASTAX CONSTELLATION

LEARN MORE


SHARE THIS PAGE

The Demise of Big Data

$
0
0

Feed: Databasejournal.com – Feature Database Articles.
Author: .

Big data as an application (or as a service) is being supplanted by artificial intelligence (AI) and machine learning. Few new requirements for a big data solution have arisen in the past few years. All the low-hanging fruit (fraud detection, customer preferences, just-in-time re-stocking and delivery, etc.) have already been big data-ized. Is this the end of big data?

 Big Data Evolution

The first big data applications were for forecasting based on historical data. IT extracted operational data on a regular basis (usually nightly) to store within a stand-alone big data solution. Product and service sales transactions could then be parsed and aggregated by time, geography and other categories looking for correlations and trends. Analysts then used these trends to make price, sales, marketing and shipping changes that increased profits.

Data tended to be well-structured and well-understood. Most data items were either text or numeric, such as product names, quantities and prices. These data types were extracted from operational databases and used in various parts of the enterprise such as data marts and the data warehouse. Analysts understood these data types and how to access and analyze them using SQL.

The next step was the transformation of big data solutions into services that could be invoked in real time by operational applications. One common example is on-line product sales, where applications can predict and detect possible financial fraud, suggest customer preferences and issue warehouse re-stocking commands based on what customers order. A major part of this transformation was allowing business analytics queries to access current data, sometimes directly against operational databases.  This meant that analysts could make real-time decisions about what was trending today; in particular, they could fix issues such as incorrect prices or poor product placement almost immediately.

Today’s Solutions

Typically, the business perceived a need for understanding data relationships or correlations that may not be obvious. Third-party vendors made available a variety of plug-and-play solutions that usually required a specific hardware platform and special software for data management and analytics. Some examples of these include the following.

Hadoop
Hadoop solutions by Apache uses a network of computer nodes. Each node participates in data storage and the nodes combine their efforts during query processing. Hadoop provides a specialized file management system that gives one or more central nodes the ability to control the others as a group and to coordinate their efforts. This works best for business problems where analytics is done iteratively; that is, business analysts run one or more base queries, store the results, and then run or re-run additional queries against those results.

NoSQL Databases
NoSQL Database solutions depend upon a technique called graph analysis, a method of finding relationships among data elements. Using NoSQL requires an experienced data scientist to support the underlying database as well as business analysts familiar with this analytic technique.

The Appliance
Initially offered as stand-alone hybrid hardware and software solution, appliances support many big data applications. In a single disk array are mounted hundreds of disk drives, and each data entity (customer, product, etc.) has its table data split evenly across all disks. When the analyst submits a query, the software splits the query into hundreds of subqueries, one for each disk drive, executes the subqueries in parallel and combines the results. It is this massively -parallel process that allows the appliance to give results so quickly. IBM’s IDAA (IBM Db2 Analytics Accelerator) is the best example of this.

Most IT organizations tended to choose one of these for its enterprise big data solution. However, in the past few years disk array storage has become larger, faster and cheaper. Combine this with the ability to implement data storage as memory rather than as disks also provided an enormous speed increase. The net result is that IT can now afford to either purchase multiple solutions and install them in-house or outsource the data storage and processing to an external provider, sometimes called database as a service (DBaaS).

Regrettably, accumulating more data hasn’t made prediction and trending analysis more useful. For example, expanding product sales history from five years to ten years isn’t very useful, since over time data changes, products change, databases change, and applications change. (One exception is customer purchase history, since it can be used to predict how customers’ preferences will change over time.) Another issue is stale data. As products reach the end of their useful life and some services are no longer offered, analysts find less and less need to issue queries about them. A final concern is new applications and the new data accompanying them. Since there is little or no history for these new data items, how can one look for correlations and trends?

In short, big data has reached the point where IT has extracted most of the value from its historical data. So what’s next?

Tomorrow’s Analytics

The current state of business analytics has shifted from simple big data applications to suites of machine learning and AI solutions. These solutions tend to be specific to either a small number of applications (such as on-line order entry) or a small set of related data (such as product sales transactions).

One example of these new systems is IBM’s Watson Machine Learning (WML) for z/OS. WML is implemented as a service that interfaces with several varieties of data on z/OS (Db2, VSAM, SMF. etc.), creates and trains machine learning models, scores them and compares the models with live metrics. The operations analytics software then classifies and analyzes the results to detect trends and relationships. (For more on this offering see the article, IBM Improves IT Operations with Artificial Intelligence.

There are several important requirements for this new analytics environment to yield value to the organization. They are:

  • An up-to-date enterprise data model;
  • An emphasis on data quality, particularly in operational systems;
  • The regular purging or archiving of stale or old data.

The need for a data model is obvious. How can AI systems develop correct and meaningful relationships among data items if they are not well-defined? The same is true for business analysts, who will assist application developers in interfacing operational applications with the data. For example, consider the development of an on-line product ordering application. The business wishes to detect (and perhaps prevent) fraud, so they want to interface their application with an AI system that can analyze current transactions against historical ones. If the data elements are not well-defined this effort will fail.

Data quality encompasses a large number of overlapping topics. Some data element values can be unequal and yet identical. For example, the street addresses of “123 Jones Road” and “123 Jones Rd. East” may indicate the exact same house in a particular city, but a basic text comparison of the data values results in inequality. Some text fields may contain internal formatting, such as “2019-01-01” and “01/01/2019”; again, the same meaning but unequal.

Dates have a penchant to be invalid or missing. Consider the date “01/01/1900”. If this value is stored in a column labelled BirthDate, is it truly correct or was a default value applied? Similar questions arise for values like “11/31/2019”, “24/01/2019”, “01/01/9999” and even “01/01/0001”.

Data quality issues extend to parent-child relationships, or referential integrity. If the Order table contains an order for customer ABC, then the Customer table should have a row for that customer.

Stale Data Purge

Stale data purges are perhaps the most important data-related issue that must be addressed before AI analytics systems can be implemented successfully. As noted previously, current IT solutions for multiple business analytics needs will grow and expand to include multiple solutions across diverse hardware platforms, both inside and outside the company. For all of these platforms to provide consistent results IT must coordinate the purge of stale data across multiple data stores.

This is more difficult than it might appear. Consider a set of products that are no longer sold by your company. Data on these products (prices, transaction details, geographic data, etc.) may be stored across several databases. This specific data may be of little use in predicting product sales, however, it might be essential in analyzing customer credit or fraud detection. If you don’t purge any of this data you are paying for storage that is wasted; if you purge it from some applications but not from others, you run the risk of losing the relationships across data items. For example, if you purge all “old” product information, what happens to your customers who purchased only those products? Should they be included in customer profile analyses?

Where Will Big Data Go?

In the 1980s there was an IT concept called the very large database (VLDB). These were databases that, for their time, were consider so large and unwieldy that they needed special management. As the size of disk storage grew and access speeds dramatically shrank, such databases were eventually considered normal, and the term VLDB was no longer used.

Such a fate awaits big data. Already we see big data becoming only a single tool in the IT support toolbox. As one example, IBM has taken its IDAA appliance, which was once a stand-alone hardware product, and physically incorporated it natively within its z14 mainframe server. As AI and machine learning software comes of age, they may well depend upon an internal big data storage solution.

Still, just as VLDB went from implementation within only a few companies to almost anywhere, big data solutions will become commodities. In fact, there are now offerings from many vendors of big data solutions that “scale”. In other words, you can try out a relatively small big data application and scale it up in size later if it provides value. The same holds true for many AI solutions.

The Future of AI and Analytics

Businesses will review AI and big data offerings and choose one or more that fit their needs. After performing a proof-of-concept test, they will implement the ones that provide the most value. Then comes the scary part. IT must find a way to coordinate multiple support processes across these multiple hardware and software platforms, including updating and maintaining the enterprise data model, performing data quality maintenance, and coordinating stale data purge. It will require staff with different skill sets and experience, and interfaces to many vendors.

Finally, IT must work towards federating all of these solutions. Even though they span different hardware platforms and come bundled with different software, IT must find a way to give access to all of the data to all of the solutions. This has already happened with the classic data warehouse. Warehouses contain dimension tables that provide the most common aggregation and subsetting definitions. These tables have already migrated into most big data applications, since business analysts will most commonly use these tables for analytics queries. In fact, it is also likely that big data queries will join tables within the big data application to warehouse base tables. The result is that many data warehouses have been moved into big data applications.

Clearly, your IT strategy must take into account this federation in the near future. Part of the original problem was identifying which platforms or solutions are best for analyzing what data. Federation addresses this by identifying a central interface for all of your analytics solutions. To make this work you must do capacity planning across all platforms and include all parties that will use analytical software in your federation solution.

And big data? It will still be with us, but it will take a back seat as the future highlights data federation and artificial intelligence solutions.

# # #

See all articles by Lockwood Lyon

Case Study: Kurtosys – Why Would I Store My Data In More Than One Database?

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

One of MemSQL’s strengths is speeding up analytics, often replacing NoSQL databases to provide faster performance. Kurtosys, a market-leading digital experience platform for the financial services industry, uses MemSQL exclusively, gaining far faster performance and easier management across transactions and analytics.

Kurtosys is a leader in the Digital Experience category, with the first truly SaaS platform for the financial services industry. In pursuing its goals, Kurtosys became an early adopter of MemSQL. Today, MemSQL is helping to power Kurtosys’ growth.

Stephen Perry, head of data at Kurtosys, summed up the first round of efforts in a blog post several years ago, titled Why Would I Store My Data In More Than One Database? (Among his accomplishments, Steve is one of the first MemSQL-certified developers.)

In the following blog post, we describe how usage of MemSQL has progressed at Kurtosys. In the first round, Kurtosys had difficulties with their original platform, using Couchbase. They moved to MemSQL, achieving numerous benefits.

Further customer requests, and the emergence of new features in MemSQL, opened the door for Kurtosys to create a new platform, which is used by Kurtosys customers to revolutionize the way they deliver outstanding digital and document experiences to their sales teams and to their external communities of clients and prospects. In this new platform, MemSQL is the database of record.

At Kurtosys, Infrastructure Powers Growth

Kurtosys has taken on a challenging task: hosting complex financial data, documents, websites, and content for the financial services industry. Kurtosys customers use the Kurtosys platform for their own customer data, as well as for their sales and marketing efforts.

The customer list for Kurtosys features many top tier firms, including Bank of America, the Bank of Montreal, Generali Investments, and T. Rowe Price. Kurtosys’ customers require high performance and high levels of security.

Customer focus on security is greater in financial services than in most other business segments. A single breach – even a potential breach that is reported, but never actually exploited – can cause severe financial and reputational damage to a company. So customers hold technology suppliers such as Kurtosys to very high standards.

Alongside security, performance is another critical element. Financial services companies claim performance advantages to gain new customers, so suppliers have to deliver reliably and at top speed. And, since financial services companies also differentiate themselves on customer service, they require suppliers to provide excellent customer service in turn.

(Like Kurtosys, MemSQL is well-versed in these challenges. Financial services is perhaps our leading market segment, with half of the top 10 US financial services firms being MemSQL customers.)

With all of these strict requirements, for financial services companies to trust an external provider to host their content – including such crucial content as customer financial data – is a major step. Yet, Kurtosys has met the challenge and is growing quickly.

“Our unique selling proposition is based around the creative use of new and unique technology,” says Steve. “We’ve progressed so far that our original internal platform with MemSQL, which we launched four years ago, is now a legacy product. Our current platform employs a very modern approach to storing data. We are using MemSQL as the primary database for the Kurtosys platform.”

Kurtosys Chooses Infrastructure for Growth

Kurtosys is adept at innovating its infrastructure to power services for demanding customers. For instance, several years ago, Kurtosys used SQL Server to execute transactions and Couchbase as a high-performance, scalable, read-only cache for analytics.

Initially, the combination made sense. Customers of Kurtosys wanted to see the company executing transactions on a database that’s among a handful of well-established transactional databases. SQL Server fit the bill.

However, like other traditional relational databases, SQL Server is, at its core, limited by its dependence on a single core update process. This dependency prevents SQL Server, and other traditional relational databases, from being able to scale out across multiple, affordable servers.

This means that the single machine running SQL Server is usually fully occupied with transaction processing and would struggle to meet Kurtosys’ requirements, such as the need for ad-hoc queries against both structured and semi-structured data. That left Kurtosys needing to copy data to another system, initially Couchbase, and run analytics off that – the usual logic for purchasing a data warehouse or an operational analytics database.

Couchbase seemed to be a logical choice. It’s considered a leading NoSQL database, and is often compared to other well-known NoSQL offerings such as Apache Cassandra, Apache HBase, CouchDB, MongoDB, and Redis. Couchbase tells its target audience that it offers developers the opportunity to “build brilliant customer experiences.”

NoSQL databases have the ability to scale out that traditional relational databases lack. However, NoSQL databases face fundamental limitations in delivering on promises such as those made by CouchBase. NoSQL databases favor unstructured or less-structured data. As the name implies, they don’t support SQL.

Users of these databases don’t benefit from decades of research and experience in performing complex operations on structured and, increasingly, semi-structured data using SQL. With no SQL support, Couchbase can be difficult to work with, and requires people to learn new skills.

Running against unstructured data and semi-structured JSON data, and without the benefit of SQL, Kurtosys found it challenging to come up with an efficient query pattern that worked across different data sets.

Kurtosys Moves to MemSQL to Power Fast Analytics

As a big data database, Couchbase is an excellent tool for data scientists running analytics projects. However, for day in and day out analytics use, it was difficult to write queries, and query performance was subpar. Couchbase was not as well suited for the workloads and high degree of concurrency – that is, large numbers of simultaneous users – required for internal user and customer analytics support, including ad hoc SQL queries, business intelligence (BI) tools, and app support.

At the same time, Kurtosys needed to stay on SQL Server for transactions. Kurtosys had invested a lot in SQL Server-specific stored procedures. Its customers also liked the fact that Kurtosys uses one of the top few best-known relational databases for transactions.

So, after much research, Kurtosys selected a fully distributed database which, at the time, ran in-memory: MemSQL. Because MemSQL is also a true relational database, and supports the MySQL wire protocol, Kurtosys was able to use the change data capture (CDC) process built into SQL Server to keep MemSQL’s copy of the data up to date. MemSQL received updates a few seconds after each transaction completed in SQL Server. Queries then ran against MemSQL, allowing both updates and queries to run fast against the respective databases.

MemSQL fast database replaces SQL Server and a CDC process.
In the original platform, updates ran against SQL Server.
CDC moved updates to MemSQL, which supported queries.

SQL Server was now fully dedicated to transaction support, with the CDC process imposing little overhead on processing. And, because of MemSQL’s speed, the database was able to easily keep up with the large and growing transaction volume going through the Kurtosys platform.

Kurtosys summed up its approach at the time in a slide deck that’s available within a Kurtosys blog post. The key summary slide is below.

MemSQL-Based Platform Powers New Applications

Kurtosys has now created a new internal platform. One of the key capabilities in the new platform is support for JSON data. In a recent MemSQL release, MemSQL 6.7, JSON data support is a core feature. In fact, comparing JSON data to fully structured data, “performance is about the same, which is a testament to MemSQL,” says Steve. With this capability, Kurtosys can keep many of the same data structures that it had previously used in Couchbase, but with outstanding performance.

Also, when Kurtosys first adopted MemSQL, several years ago, MemSQL was largely used as an in-memory database. This gave truly breakthrough performance, but with accompanying higher costs. Today, MemSQL flexibly supports both rowstore tables in memory and disk-based columnstore. “Performance,” says Steve, “is almost too good to believe.”

The new platform runs MemSQL for both transactions and queries. In the new platform, there’s no longer a need for CDC. Kurtosys runs MemSQL as a transactional database, handling both transactions and analytics.

MemSQL is a translytical, converged, HOAP, HTAP, NewSQL database.
In the new platform, updates and queries all run MemSQL.

The new internal platform powers Kurtosys applications with thousands of concurrent users, accessing hundreds of gigabytes of data, and with a database growing by several gigabytes of data a day.

Kurtosys is looking forward to using the new features of MemSQL to power the growth of their platform. As Steve Perry says, in a separate blog post, “What they do, they do right… we use MemSQL to improve the performance of query response.”

Stepping Ahead with MemSQL

MemSQL is a fundamental component of the key value proposition that Kurtosys offers its customers – and cutting-edge platforms, like the one being developed at Kurtosys today, will continue to push MemSQL forward.

To see the benefits of MemSQL for yourself, you can try MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.

Azure Marketplace new offers – Volume 37

$
0
0

Feed: Microsoft Azure Blog.
Author: Christine Alford.

Accela Civic Platform and Civic Applications

Accela Civic Platform and Civic Applications: Accela’s fast-to-implement civic applications and robust and extensible solutions platform help agencies respond to the rapid modernization of technology with SaaS solutions that offer high degrees of security, flexibility, and usability.

Actifile Guardrail-Secure Data on 0 Trust Devices

Actifile Guardrail-Secure Data on 0 Trust Devices: Actifile’s Guardrail unique low-footprint technology enables secure usage of corporate data taken from any application or data source.

Adrenalin HCM

Adrenalin HCM: Human resource function is the quintessential force that enables an organization’s strongest asset to perform better and benefit themselves and the company. Reimagine your HR function with Adrenalin HCM.

Advanced Threat Protection for OneDrive

Advanced Threat Protection for OneDrive: BitDam helps enterprises take full advantage of all OneDrive has to offer while delivering advanced threat protection against content-borne attacks.

AGR - Advanced Demand Planning

AGR – Advanced Demand Planning: This modular AGR solution allows you to make more consistent planning decisions and more accurate buying decisions and helps ensure you have the right product in the right place at the right time.

agroNET - Digital Farming Management Platform

agroNET – Digital Farming Management Platform: agroNET is a turnkey digital farming solution that enables smart agriculture service providers and system integrators to rapidly deploy the service tailored to the needs of farmers.

AIMSCO Azure MES - QM Platform for SME Manufacturers

AIMSCO Azure MES/QM Platform for SME Manufacturers: With embedded navigation dashboards, displays, alerts, APIs, and BI interfaces, AIMSCO Azure MES/QM Platform users from the shop floor to the boardroom have real-time access to critical decision-making tools.

AIRA Robotics as a Service

AIRA Robotics as a Service: Transform the installation of new equipment from CAPEX to OPEX as a part of a digital transformation using the AIRA digitalization system for long-term service relationships with suppliers.

APEX_Logo_KMH

Apex Portal: Use Apex Portal for supplier registration, self-service inquiry of invoice and payment status, dynamic discounting and early payments, and automated statement audits.

AppStudio

AppStudio: AppStudio is a suite of offerings for managing apps using a standardized methodology to ensure you are up to date and ready for the next challenge.

ArcBlock ABT Blockchain Node

ArcBlock ABT Blockchain Node: ABT Blockchain Node is fully decentralized and uses ArcBlock’s blockchain development platform to easily build, run, and use DApps and blockchain-ready services.

ArcGIS Enterprise

ArcGIS Enterprise 10.7: Manage, map, analyze, and share geographic information systems (GIS) data with ArcGIS Enterprise, the complete geospatial system that powers your data-driven decisions.

Area 1 Horizon Anti-Phishing Service for Office365

Area 1 Horizon Anti-Phishing Service for Office 365: Area 1 Security closes the phishing gap with a preemptive, comprehensive, and accountable anti-phishing service that seamlessly integrates with and fortifies Microsoft Office 365 security defenses.

Arquivar-GED

Arquivar-GED: ArqGED is document management software that allows users to dynamically solve problems with location and traceability of information in any format (paper, digital, microfilm, etc.).

Aruba Virtual Gateway SD-WAN

Aruba Virtual Gateway (SD-WAN): Aruba’s software-defined WAN (SD-WAN) technology simplifies wide area network operations and improves application QoS to lower your total cost of ownership.

Arundo Analytics

Arundo Analytics: Arundo delivers enterprise-scale machine learning and advanced analytics applications to improve operations in heavy asset industries.

Assurity Suite

Assurity Suite: The Assurity Suite platform provides assurance and control over your organization’s documents, communications, investigations, compliance, information, and processes.

AtilektNET

Atilekt.NET: Website-building platform Atilekt.NET is a friendly, flexible, and fast-growing content management system based on ASP.NET.

Axians myOperations Patch Management

Axians myOperations Patch Management: Axians myOperations Server Patch Management integrates a complete management solution to simplify the rollout, monitoring, and reporting of Windows updates.

Axioma Risk

Axioma Risk: Axioma Risk is an enterprise-wide risk-management system that enables clients to obtain timely, consistent, and comparable views of risk across an entire organization and all asset classes.

Azure Analytics System Solution

Azure Analytics System Solution: BrainPad’s Azure Analytics System Solution is designed for enterprises using clouds for the first time as well as companies considering sophisticated usage. This application is available only in Japanese.

Beam Communications

Beam Communications: Communications are a fundamental element in institutional development, and Beam Communications boosts internal and external communications. This application is available only in Spanish.

Betty Blocks Platform

Betty Blocks Platform: From mobile apps to customer portals to back-office management and everything in between, the Betty Blocks platform supports every app size and complexity.

BIClinical

BI-Clinical: BI-Clinical is CitiusTech’s ONC- and NCQA-certified BI and analytics platform designed to address the healthcare organization’s most critical quality reporting and decision support needs.

Bizagi Digital Business Platform

Bizagi Digital Business Platform: The Bizagi platform helps enterprises embrace change by improving operational efficiencies, time to market, and compliance.

Bluefish Editor on Windows Server 2019

Bluefish Editor on Windows Server 2019: The Bluefish software editor supports a plethora of programming languages including HTML, XHTML, CSS, XML, PHP, C, C++, JavaScript, Java, Google Go, Vala, Ada, D, SQL, Perl, ColdFusion, JSP, Python, Ruby, and Shell.

BotCore - Enterprise Chatbot Builder

BotCore – Enterprise Chatbot Builder: BotCore is an accelerator that enables organizations to build customized conversational bots powered by artificial intelligence. It is fully deployable to Microsoft Azure and leverages many of the features available in it.

Brackets

Brackets: With focused visual tools and preprocessor support, Brackets is a modern text editor that makes it easy to design in the browser. It’s crafted for web designers and front-end developers.

Brackets On Windows Server 2019

Brackets on Windows Server 2019: With focused visual tools and preprocessor support, Brackets is a modern text editor that makes it easy to design in the browser. It’s crafted for web designers and front-end developers.

bugong

bugong: The bugong platform combines leading algorithm technology with intelligent manufacturing management. This application is available only in Chinese.

Busit Application Enablement Platform

Busit Application Enablement Platform: Busit Application Enablement Platform (AEP) enables fast and efficient handling of all your devices and services, regardless of the brand, manufacturer, or communication protocol.

ByCAV

ByCAV: ByCAV provides biometric identity validation through non-traditional channels for companies in diverse industries that require identity verification. This application is available in Spanish only in Colombia.

Camel Straw

Camel Straw: Camel Straw is a cloud-based load testing platform that helps teams load test and analyze and improve the way their applications scale.

Celo

Celo: Celo connects healthcare professionals. From big hospitals to small clinics, Celo helps healthcare professionals communicate better.

Cirkled In - College Recruitment Platform

Cirkled In – College Recruitment Platform: Cirkled In is a revolutionary, award-winning recruitment platform that helps colleges match with best-fit high school students based on students’ holistic portfolio.

Cirkled In - Student Profile & Portfolio Platform

Cirkled In – Student Profile & Portfolio Platform: Cirkled In is a secure, award-winning electronic portfolio platform for students designed to compile students’ achievements in seven categories from academics to sports to volunteering and more.

Cleafy Fraud Manager for Azure

Cleafy Fraud Manager for Azure: Cleafy combines deterministic malware detection with passive behavioral and transactional risk analysis to protect online services against targeted attacks from compromised endpoints without affecting your users and business.

Cloud Desktop

Cloud Desktop: Cloud Desktops on Microsoft Azure offers continuity and integration with the tools and applications that you already use.

Cloud iQ - Cloud Management Portal

Cloud iQ – Cloud Management Portal: Crayon Cloud-iQ is a self-service platform that enables you to manage cloud products (Azure, Office 365, etc.), services, and economics across multiple vendors through a single pane portal view.

Cloudneeti - Continuous Assurance SaaS

Cloudneeti – Continuous Assurance SaaS: Cloudneeti SaaS enables instant visibility into security, compliance, and data privacy posture and enforces industry standards through continuous and integrated assurance aligned with the cloud-native operating model.

Collaboro - Digital Asset Management

Collaboro – Digital Asset Management: Collaboro partners with brands, institutions, government, and advertising agencies to solve their specific digital asset management needs in a fragmented marketing and media space.

Connected Drone

Connected Drone: Targeting power and utilities, eSmart Systems Connected Drone software utilizes deep learning to dramatically reduce utility maintenance costs and failure rates and extend asset life.

CyberVadis

CyberVadis: By pooling and sharing analyst-validated cybersecurity audits, CyberVadis allows you to scale up your third-party risk assessment program while controlling your costs.

Data Quality Management Platform

Data Quality Management Platform: BaseCap Analytics’ Data Quality Management Platform helps you make better business decisions by measurably increasing the quality of your greatest asset: data.

DatabeatOMNI

DatabeatOMNI: DatabeatOMNI provides you with everything you need to display great content, on as many screens as you want to – without complex interfaces, specialist training, or additional procurement costs.

dataDiver

dataDiver: dataDiver is an extended analytics tool for gaining insights into research design that is neither traditional BI nor BA. This application is available only in Japanese.

dataFerry

dataFerry: dataFerry is a data preparation tool that allows you to easily process data from various sources into the desired form. This application is available only in Japanese.

Dataprius Cloud

Dataprius Cloud: Dataprius offers a different way to work with files in the cloud, allowing you to work with company files without synchronizing, without conflicts, and with multiple users connected at the same time.

Denodo Platform 7.0 14 day Free Trial BYOL

Denodo Platform 7.0 14-day Free Trial (BYOL): Denodo integrates all of your Azure data sources and your SaaS applications to deliver a standards-based data gateway, making it quick and easy for users of all skill levels to access and use your cloud-hosted data.

Descartes MacroPoint

Descartes MacroPoint: Descartes MacroPoint consolidates logistics tracking data from carriers into a single integrated platform to meet two growing challenges: real-time freight visibility and automated capacity matching.

Digital asset management DAM Managed Application

Digital Asset Management (DAM) Managed Application: Digital Asset Management delivers a secured and centralized repository to manage videos. It offers capabilities for advanced embed, review, approval, publishing, and distribution of videos.

Digital Fingerprints

Digital Fingerprints: Digital Fingerprints is a continuous authentication system based on behavioral biometrics.

DM REVOLVE - Dynamics Data Migration

DM REVOLVE – Dynamics Data Migration: DM REVOLVE is a dedicated Azure-based Dynamics end-to-end data migration solution that incorporates “Dyn-O-Matic,” our specialized Dynamics automated load adaptor.

Docker Community Edition Ubuntu Bionic Beaver

Docker Community Edition Ubuntu Bionic Beaver: Deploy Docker Community Edition with Ubuntu on Azure with this free, community-supported, DIY version of Docker on Ubuntu.

Docker Community Edition Ubuntu Xenial

Docker Community Edition Ubuntu Xenial: Deploy Docker Community Edition with Ubuntu on Azure with this community-supported, DIY version of Docker on Ubuntu.

Dom Rock AI for business platform

Dom Rock AI for Business Platform: The Dom Rock AI for business platform empowers people to make better and faster decision enlightened by data. This application is available only in Portuguese.

Done.pro

Done.pro: Done.pro will enable Uber for X cloud platforms customized and tuned for your business in order to provide customers with exceptional service.

eComFax - secure advanced messaging platform

eComFax: Secure Advanced Messaging Platform: Comunycarse Network Consultants eComFax is a secure, advanced messaging platform designed for compliance and mobility.

EDGE

EDGE: The Edge system allows seamless operations across the UK – in both the established Scottish market and the new English market.

eJustice

eJustice: The eJustice solution provides information and communication technology enablement for courts.

ekonet - air quality monitoring

ekoNET – Air Quality Monitoring: ekoNET combines portable devices and cloud-based functionality to enable granular air quality monitoring indoors and outdoors.

Element AssetHub

Element AssetHub: AssetHub is a data hub connecting time series, IT, and OT to manage operational asset models.

Equinix Cloud Exchange Fabric

Equinix Cloud Exchange Fabric: This software-defined interconnection solution allows you to directly, securely, and dynamically connect distributed infrastructure and digital ecosystems to your cloud service providers.

ERP Beam Education

ERP Beam Education: ERP Beam Education efficiently integrates all the processes that are part of managing an educational center. This application is available only in Spanish.

Essatto Data Analytics Platform

Essatto Data Analytics Platform: Essatto enables more informed decision making by providing timely insights into your financial and business operations in a flexible, cost-effective application.

Event Monitor

Event Monitor: Event Monitor is a user-friendly solution meant for security teams that are responsible for safety.

Firewall as a Service

Firewall as a Service: Firewall as a Service delivers a next-generation managed internet gateway from Microsoft Azure including 24/7 support, self-service, and unlimited changes by our security engineers.

GDPR   for Data Protection & Security

GDPR++ for Data Protection & Security: GDPR++ is an Azure-based tool that helps companies keep data protection and cyber security under control.

GEODI

GEODI: GEODI helps you focus on your business by letting you share information, documents, notes, and notifications with contacts and stakeholders via mobile app or browser.

GeoServer

GeoServer: Make your spatial information accessible to all with this free, community-supported open source server based on Java for sharing geospatial data.

Geoserver On Windows Server 2019

GeoServer on Windows Server 2019: Make your spatial information accessible to all with this free, community-supported open source server based on Java for sharing geospatial data.

Ghost Helm Chart

Ghost Helm Chart: Ghost is a modern blog platform that makes publishing beautiful content to all platforms easy and fun. Built on Node.js, it comes with a simple markdown editor with preview, theming, and SEO built in.

Grafana Multi-Tier with Azure Managed DB

Grafana Multi-Tier with Azure Managed DB: Grafana is an open source analytics and monitoring dashboard for over 40 data sources, including Graphite, Elasticsearch, Prometheus, MariaDB/MySQL, PostgreSQL, InfluxDB, OpenTSDB, and more.

HashiCorp Consul Helm Chart

HashiCorp Consul Helm Chart: HashiCorp Consul is a tool for discovering and configuring services in your infrastructure.

HPCBOX HPC Cluster for STAR-CCM

HPCBOX: HPC Cluster for STAR-CCM+: HPCBOX combines cloud infrastructure, applications, and managed services to bring supercomputer technology to your personal computer.

H-Scale

H-Scale: H-Scale is a modular, configurable, and scalable data integration platform that helps organizations build confidence in their data and accelerate their data strategies.

Integrated Cloud Suite

Integrated Cloud Suite: CitiusTech’s Integrated Cloud Suite is a one-stop solution that enables healthcare organizations to reduce complexity and drive a multi-cloud strategy optimally and cost-effectively.

JasperReports Helm Chart

JasperReports Helm Chart: JasperReports Server is a standalone and embeddable reporting server. It is a central information hub, with reporting and analytics that can be embedded into web and mobile applications.

Jenkins Helm Chart

Jenkins Helm Chart: Jenkins is a leading open source continuous integration and continuous delivery (CI/CD) server that enables the automation of building, testing, and shipping software projects.

Jenkins On Ubuntu Bionic Beaver

Jenkins On Ubuntu Bionic Beaver: Jenkins is a simple, straightforward continuous integration tool that effortlessly distributes work across multiple devices and assists in building drives, tests, and deployment.

Jenkins-Docker CE on Ubuntu Bionic Beaver

Jenkins-Docker CE on Ubuntu Bionic Beaver: This solution takes away the hassles of setting up the installation process of Jenkins and Docker. The ready-made image integrates Jenkins-Docker to make continuous integration jobs smooth, effective, and glitch-free.

Join2ship

Join2ship: Join2ship is a collaborative supply chain platform designed to digitalize your receipts and deliveries.

Kafka Helm Chart

Kafka Helm Chart: Tested to work on the EKS platform, Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Kaleido Enterprise Blockchain SaaS

Kaleido Enterprise Blockchain SaaS: Kaleido simplifies the process of creating and operating permissioned blockchains with a seamless experience across cloud properties and geographies for all network participants.

Kubeapps Helm Chart

Kubeapps Helm Chart: Kubeapps is a web-based application deployment and management tool for Kubernetes clusters.

LOOGUE FAQ

LOOGUE FAQ: LOOGUE FAQ is an AI virtual agent that creates chatbots that support queries by creating and uploading two columns of questions and answers in Excel. This application is available only in Japanese.

Magento Helm Chart

Magento Helm Chart: Magento is a powerful open source e-commerce platform. Its rich feature set includes loyalty programs, product categorization, shopper filtering, promotion rules, and much more.

MariaDB Helm Chart

MariaDB Helm Chart: MariaDB is an open source, community-developed SQL database server that is widely used around the world due to its enterprise features, flexibility, and collaboration with leading tech firms.

Metrics Server Helm Chart

Metrics Server Helm Chart: Metrics Server aggregates resource usage data, such as container CPU and memory usage, in a Kubernetes cluster and makes it available via the Metrics API.

MNSpro Cloud Basic

MNSpro Cloud Basic: MNSpro Cloud combines the management of your school network with a learning management system, whether you use Windows, iOS, or Android devices.

MongoDB Helm Chart

MongoDB Helm Chart: MongoDB is a scalable, high-performance, open source NoSQL database written in C++.

MySQL 5.6 Secured Ubuntu Container with Antivirus

MySQL 5.6 Secured Ubuntu Container with Antivirus: MySQL is a popular open source relational SQL database management system. MySQL is one of the best RDBMS being used for developing web-based software applications.

MySQL 8.0 Secured Ubuntu Container with Antivirus

MySQL 8.0 Secured Ubuntu Container with Antivirus: MySQL is a popular open source relational SQL database management system. MySQL is one of the best RDBMS being used for developing web-based software applications.

MySQL Helm Chart

MySQL Helm Chart: MySQL is a fast, reliable, scalable, and easy-to-use open source relational database system. MySQL Server is designed to handle mission-critical, heavy-load production applications.

NATS Helm Chart

NATS Helm Chart: NATS is an open source, lightweight, and high-performance messaging system. It is ideal for distributed systems and supports modern cloud architectures and pub-sub, request-reply, and queuing models.

NetApp Cloud Volumes ONTAP

NetApp Cloud Volumes ONTAP: NetApp Cloud Volumes ONTAP, a leading enterprise-grade storage management solution, delivers secure, proven storage management services and supports up to a capacity of 368 TB.

Nodejs Helm Chart

Node.js Helm Chart: Node.js is a runtime environment built on V8 JavaScript engine. Its event-driven, non-blocking I/O model enables the development of fast, scalable, and data-intensive server applications.

Node 6 Secured Jessie Container with Antivirus

Node 6 Secured Jessie Container with Antivirus: Node.js is an open source, cross-platform JavaScript runtime environment for developing a diverse variety of tools and applications.

Odoo Helm Chart

Odoo Helm Chart: Odoo is an open source ERP and CRM platform that can connect a wide variety of business operations such as sales, supply chain, finance, and project management.

On-Demand Mobility Services Platform

On-Demand Mobility Services Platform: Deploy this intelligent, on-demand transportation operating system for automotive OEMs that need to run professional mobility services to embrace the new automotive era and manage the decline of vehicle ownership.

OpenCart Helm Chart

OpenCart Helm Chart: OpenCart is free open source e-commerce platform for online merchants. OpenCart provides a professional and reliable foundation from which to build a successful online store.

OrangeHRM Helm Chart

OrangeHRM Helm Chart: OrangeHRM is a feature-rich, intuitive HR management system that offers a wealth of modules to suit the needs of any business. This widely used system provides an essential HR management platform.

Osclass Helm Chart

Osclass Helm Chart: Osclass allows you to easily create a classifieds site without any technical knowledge. It provides support for presenting general ads or specialized ads and is customizable, extensible, and multilingual.

ownCloud Helm Chart

ownCloud Helm Chart: ownCloud is a file storage and sharing server that is hosted in your own cloud account. Access, update, and sync your photos, files, calendars, and contacts on any device, on a platform that you own.

Paladion MDR powered by AI Platform - AIsaac

Paladion MDR powered by AI Platform – AI.saac: Paladion’s managed detection and response, powered by our next-generation AI platform, is a managed security service that provides threat intelligence, threat hunting, security monitoring, incident analysis, and incident response.

Parse Server Helm Chart

Parse Server Helm Chart: Parse is a platform that enables users to add a scalable and powerful back end to launch a full-featured app for iOS, Android, JavaScript, Windows, Unity, and more.

Phabricator Helm Chart

Phabricator Helm Chart: Phabricator is a collection of open source web applications that help software companies build better software.

 PHP5.6 Secured Jessie-cli Container with Antivirus

PHP 5.6 Secured Jessie-cli Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

 PHP5.6 Secured Stretch Container with Antivirus

PHP 5.6 Secured Stretch Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.0 Secured Jessie Container with Antivirus

PHP 7.0 Secured Jessie Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.0 Secured Jessie-cli Container - Antivirus

PHP 7.0 Secured Jessie-cli Container – Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.0 Secured Stretch Container with Antivirus

PHP 7.0 Secured Stretch Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.1 Secured Jessie Container with Antivirus

PHP 7.1 Secured Jessie Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.1 Secured Jessie-cli Container with Antivirus

PHP 7.1 Secured Jessie-cli Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.1 Secured Stretch Container with Antivirus

PHP 7.1 Secured Stretch Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.2 Secured Stretch Container with Antivirus

PHP 7.2 Secured Stretch Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

PHP 7.3 Rc Stretch Container with Antivirus

PHP 7.3 Rc Stretch Container with Antivirus: PHP is a server-side scripting language designed for web development. It is mainly used for server-side scripting and can collect form data, generate dynamic page content, and send and receive cookies.

phpBB Helm Chart

phpBB Helm Chart: phpBB is a popular bulletin board that features robust messaging capabilities such as flat message structure, subforums, topic split/merge/lock, user groups, full-text search, and attachments.

PostgreSQL Helm Chart

PostgreSQL Helm Chart: PostgreSQL is an open source object-relational database known for reliability and data integrity. ACID-compliant, it supports foreign keys, joins, views, triggers, and stored procedures.

Project Ares

Project Ares: Project Ares by Circadence is an award-winning, gamified learning and assessment platform that helps cyber professionals of all levels build new skills and stay up to speed on the latest tactics.

Python Secured Jessie-slim Container - Antivirus

Python Secured Jessie-slim Container – Antivirus: This image is made for customers who are looking for deploying a self-managed Community Edition on Hardened kernel instead of just putting up a vanilla install.

Quvo

Quvo: Quvo is a cloud-first, mobile-first working platform designed especially for public sector and enterprise mobile workforces.

RabbitMQ Helm Chart

RabbitMQ Helm Chart: RabbitMQ is a messaging broker that gives your applications a common platform to send and receive messages, and your messages a safe place to live until received.

Recordia

Recordia: Smart Recording & Archiving Interactions: Recordia facilitates gathering all valuable customer interactions under one single repository in the cloud. Know how your sales, marketing, and support staff is doing.

Redis Helm Chart

Redis Helm Chart: Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, and sorted sets.

Redmine Helm Chart

Redmine Helm Chart: Redmine is a popular open source project management and issue tracking platform that covers multiple projects and subprojects, each with its own set of users and tools, from the same place.

Secured MySQL 5.7 on Ubuntu 16.04 LTS

Secured MySQL 5.7 on Ubuntu 16.04 LTS: MySQL is a popular open source relational SQL database management system. MySQL is one of the best RDBMS being used for developing web-based software applications.

Secured MySQL 5.7 on Ubuntu 18.04 LTS

Secured MySQL 5.7 on Ubuntu 18.04 LTS: MySQL is a popular open source relational SQL database management system. MySQL is one of the best RDBMS being used for developing web-based software applications.

Smart Planner

Smart Planner: Smart Planner is a web platform for the optimization of productive processes, continuous improvement, and integral management of the supply chain. This application is available only in Spanish.

SmartVM API - Improve your vendor master file

SmartVM API – Improve your vendor master file: The SmartVM API vendor master cleansing, enriching, and continuous monitoring technology automates vendor master management to help you mitigate risks, eliminate costly information gaps, and improve your supplier records.

SuiteCRM Helm Chart

SuiteCRM Helm Chart: SuiteCRM is an open source, enterprise-grade customer relationship management (CRM) application that is a fork of the popular SugarCRM application.

Talend Cloud - Remote Engine for Azure

Talend Cloud: Remote Engine for Azure: Talend Cloud is a unified, comprehensive, and highly scalable integration Platform as-a-Service (iPaaS) that makes it easy to collect, govern, transform, and share data.

TensorFlow ResNet Helm Chart

TensorFlow ResNet Helm Chart: TensorFlow ResNet is a client utility for use with TensorFlow Serving and ResNet models.

Terraform On Windows Server 2019

Terraform on Windows Server 2019: Terraform is used to create, change, and improve your infrastructure via declarative code.

TestLink Helm Chart

TestLink Helm Chart: TestLink is test management software that facilitates software quality assurance. It supports test cases, test suites, test plans, test projects and user management, and stats reporting.

Tomcat Helm Chart

Tomcat Helm Chart: Tomcat is a widely adopted open source Java application and web server. Created by the Apache Software Foundation, it is lightweight and agile with a large ecosystem of add-ons.

Transfer Center

Transfer Center: The comprehensive patient analytics and real-time reporting in Transfer Center help ensure improved care coordination, streamlined patient flow, and full regulatory compliance.

Unity Cloud

Unity Cloud: Unity is underpinned by Docker, so you can write custom full-code extensions in any language and enjoy fault tolerance, high availability, and scalability.

User Management Pack 365

User Management Pack 365: User Management Pack 365 is a powerful software application that simplifies user lifecycle and identity management across Skype for Business deployments.

Visual Studio Emulator on Windows Server 2016

Visual Studio Emulator on Windows Server 2016: Visual Studio Emulator plays an important role in the edit-compile-debug cycle of your Android testing.

Webfopag - Online Payroll

Webfopag – Online Payroll: Fully process payroll while meeting your business compliance rules. This application is available only in Portuguese.

WordPress Helm Chart

WordPress Helm Chart: WordPress is one of the world’s most popular blogging and content management platform. It is powerful yet simple, and everyone from students to global corporations use it to build beautiful, functional websites.

Xampp

XAMPP: XAMPP is specifically designed to make it easier for developers to install the distribution to get into the Apache universe.

Xampp Windows Server 2019

XAMPP Windows Server 2019: XAMPP is specifically designed to make it easier for developers to install the distribution to get into the Apache universe.

XS VM Lift & Shift

XS VM Lift & Shift with Provisioning & Metering: Modernize migration, provisioning, and automatic metering with the Beacon42 metering tool. This application is available only in Spanish.

ZooKeeper Helm Chart

ZooKeeper Helm Chart: ZooKeeper provides a reliable, centralized register of configuration data and services for distributed applications.

Webinar: The Benchmark Breakthrough Using MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL has reached a benchmarking breakthrough: the ability to run three very different database benchmarks, fast, on a single, scalable database. The leading transactions benchmark, TPC-C, and analytics benchmarks, TPC-H and TPC-DS, don’t usually run on the same scale-out database at all. But MemSQL runs transactional and analytical workloads simultaneously, on the same data, and with excellent performance.

As we describe in this webinar write-up, our benchmarking breakthrough demonstrates this unusual, and valuable, set of capabilities. You can also read a detailed description of the benchmarks and view the recorded webinar.

MemSQL stands out because it is a relational database, with native SQL support – like legacy relational databases – but also fully distributed, horizontally scalable simply by adding additional servers, like NoSQL databases. This kind of capability – called NewSQL, translytical, HTAP, or HOAP – is becoming more and more highly valued for its power and flexibility. It’s especially useful for a new category of workloads called operational analytics, where live, up-to-date data is streamed into a data store to drive real-time decision-making.

MemSQL benchmarked successfully against TPC-C, TPC-H, and TPC-DS

The webinar was presented by two experienced MemSQL pros: Eric Hanson, principal product manager, and Nick Kline, director of engineering. Both were directly involved in the benchmarking effort.

MemSQL and Transaction Performance – TPC-C

The first section of the webinar was delivered by Eric Hanson.

The first benchmark we tested was TPC-C, which tests transaction throughput against various data sets. This benchmark uses two newer MemSQL capabilities:

  • SELECT FOR UPDATE, added in our MemSQL 6.7 release.
  • Fast synchronous replication and durability for fast synchronous operations, part of our upcoming MemSQL 7.0 release. (The relevant MemSQL 7.0 beta is available.)

To demonstrate what MemSQL can do in production, we disabled rate limiting and used asynchronous durability. This gives a realistic aspect to the results, but it means that they can’t be compared directly to certified TPC-C results.

MemSQL runs transactions fast and is a scalable database, at a close to linear rate.

These results showed high sync replication performance, with excellent transaction rates, and near-linear scaling of performance as additional servers are added. For transaction processing, MemSQL delivers speed, scalability, simplicity, and both serializability and high availability (HA) to whatever extent needed.

MemSQL and Analytics Performance – TPC-H and TPC-DS

The second section of the webinar was delivered by Nick Kline.

Data warehousing benchmarks use a scale factor of 10TB of data at a time. MemSQL is very unusual in being able to handle both fast transactions, as shown by the TPC-C results, and fast analytics, as shown by these TCP-H and TPC-DS results – on the same data, at the same time.

MemSQL is now being optimized, release to release, in both areas at once. Query optimization is an ongoing effort, with increasingly positive results. Nick described, in some detail, how two queries from the TPC-H benchmark get processed through the query optimizer and executed. The breakdown for one query, TPC-H Query 3, is shown here.

Breaking down a TPC-H query to show how MemSQL avoids slow queries.

The TPC-DS benchmark is somewhat of an updated and more complex version of the TPC-H benchmark alluded to above. In fact, it’s so challenging that many databases – even those optimized for analytics, can’t run it effectively, or can’t run some of the queries. MemSQL can run all the queries for both TPC-H and TPC-DS, as well as for TPC-C, and all with good results.

For TPC-H, smaller numbers are better. MemSQL was able to achieve excellent results on TPC-H with a relatively moderate hardware budget.

MemSQL gets excellent database benchmarking results against moderate hardware.

Results for TPC-DS were also very good. Because queries on TPC-DS vary greatly in their complexity, query results vary between very short and very long result times. As a result, the geometric mean is commonly used to express the results. We compared MemSQL to several existing published results. Smaller is better.

MemSQL shows itself as a fast database against the somewhat intimidating TPC-DS benchmark.

Q&As for the MemSQL Benchmarks Webinar

The Q&A was shared between Eric and Nick. Also, these Q&As are paraphrased; for the more detailed, verbatim version, view the recorded webinar. Both speakers also referred to our detailed benchmarking blog post.

Q. Does MemSQL get used for these purposes in production?

A. (Hanson) Yes. One example is a wealth management application at a top 10 US bank, running in real-time. Other examples include gaming consoles and IoT implementations in the energy industry.

Q. Should we use MemSQL for data warehousing applications, operational database needs, or both?

A. (Hanson) Our benchmarking results show that MemSQL is excellent across a range of applications. However, MemSQL is truly exceptional for operational analytics, which combines aspects of both. So we find that many of our customers begin their usage of MemSQL in this area, then extend it to aspects of data warehousing on the one hand, transactions on the other, and merged operations.

Q. How do we decide whether to use rowstore or columnstore?

A. (Kline) Rowstore tables fit entirely in memory and are best suited to transactions, though they get used for analytics as well. For rowstore, you have to spec the implementation so it has enough memory for the entire application. Columnstore also does transactions, somewhat more slowly, and is disk-based, though MemSQL still does much of its work in memory. And columnstore is the default choice for analytics at scale. (Also, see our rowstore vs. columnstore blog post. – Ed.)

Q. How do you get the performance you do?

A. (Hanson) There’s a lot to say here, but I can mention a few highlights. Our in-memory data tables are very fast. We compile queries to machine code, and we also work against compressed data, without the need to decompress it first – this can cut out 90% of the time that would otherwise be needed to, for instance, scan a record.

We have super high performance for both transactions and analytics against rowstore. For columnstore, we use vectorized query execution. Since the early 2000s, there’s a new approach, in which you process not single rows, but thousands of rows at a time. So for filtering a column, as an example, we do it 4000 rows at a time, in tight loops. Finally, we use single instruction, multiple data (SIMD) instructions as part of parallelizing operations.

Conclusion

To learn more about MemSQL and the improvements in MemSQL 6.8, view the recorded webinar. You can also read the benchmarking blog post and view the benchmarking webinar. Also, you can get started with MemSQL for free today.

10 Areas of Expertise in Data Science

$
0
0

Feed: Featured Blog Posts – Data Science Central.
Author: Divya Singh.

The analytics market is booming, and so is the use of the keyword – Data Science. Professionals from different disciplines are using data in their day to day activities, and feel the need to master the start-of-the-art technology in order to get maximum insights from the data, and subsequently help the business to grow.

Moreover, there are professionals who want to keep them updated with this latest skills such as Machine Learning, Deep Learning, Data Science, and so either to elevate their career or move to a different career altogether. The role of a Data Scientist is regarded as the sexiest job in the 21stcentury making it increasingly lucrative for most people to turn down.

However, making a transition to Data Science, or starting a career in it as a fresher is not an easy task. The supply-demand gap is gradually diminishing as more, and more people are willing to master this technology. There is often a misconception among professionals, and companies as to what Data Science is, and in many scenarios the term has been misused for various small scale tasks.

To be a Data Scientist, you need to have a passion, and zeal to play with data, and a desire to make digits and numbers talk. It is a mixture of various things, and there are a plethora of skills one has to master to be a called a Full Stack Data Scientist. The list of skills often gets overwhelming for an individual who could quit, given the enormity of its applications, and a continuous learning mindset the field of Data Science demands.

In this article, we would walk you through the ten areas in Data Science which are a key part of a project, and you need to master those to be able to work as a Data Scientist in much big organization.

  • Data Engineering – To work in any Data Science project, the most important aspect of it is the data. You need to understand which data to use, how to organize the data, and so on. This bit of manipulation with the data is done by a Data Engineer in a Data Science team. It is a superset of Data Warehousing and Business Intelligence which included the concept of big data in the context.

Building, and maintain a Data warehouse is a key skill which a Data Engineer must have. They would prepare the structured, and the unstructured data to be used by the Analytics team for model building purpose. They build pipelines which extract data from multiple sources and then manipulates it to make it usable.

Python, SQL, Scala, Hadoop, Spark, etc., are some of the skills that a Data Engineer has. They should also understand the concept of ETL. The data lakes in Hadoop is one of the key areas of work for a Data Engineer. The NoSQL database is mostly used as part of the data workflows. Lambda architecture allows both batch and real-time processing.

Some of the job role available in the data engineering domain is Database Developer, Data Engineer, etc.

  • Data Mining – It is the process of extracts insights from the data using certain methodologies for the business to make smart decisions. It distinguishes the previously unknown patterns and relationships from the data. Through data mining, one could transform the data into various meaningful structures in accordance with the business. The application of data mining depends on the industry. Suppose in finance, it is used in risk or fraud analytics. In manufacturing, product safety, and quality issues could be analyzed with accurate mining. Some of the parameters in data mining are Path Analysis, Forecasting, Clustering, and so on. Business Analyst, Statistician are some of the related jobs in the data mining space.
  • Cloud Computing – A lot of companies these days are migrating their infrastructure from local to the cloud merely because of the ready-made availability of the resources, and the huge computational power which not always available in a system. Cloud computing generally refers to the implementation of platforms for distributed computing. The system requirements are analyzed to ensure seamless integration with present applications. Cloud Architect, Platform Engineer are some of the jobs related to it.
  • Database Management – The rapidly changing data makes it imperative for the companies to ensure accuracy in tracking the data on a regular basis. This minute data could empower the business to make time strategic decisions, and maintain a systematic workflow. The collected data is used to generate reports and is made available for the management in the form of relational databases. The Database management system maintains a link among the data, and also allows newer updates. The structured format in the form of databases helps management to look for data in an efficient manner. Data Specialist, Database Administrator are some of the jobs for it.
  • Business Intelligence – The area of business intelligence refers to finding patterns in historical data of a business. Business Intelligence analysts would find the trends for a data scientist to build predictive models upon. It is about answering not-so-obvious questions. Business Intelligence answers the ‘what’ of a business. Business Intelligence is about creating dashboards and drawing insights from the data. For a BI analyst, it is important to learn data handling, and masters the tools like Tableau, Power BI, SQL, and so on. Additionally, proficiency in Excel is a must in business intelligence.
  • Machine Learning – Machine Learning is the state-of-the-art methodology to make predictions from the data, and help the business make better decisions. Once the data is curated by the Data Engineer and analyzed by a Business Intelligence Analyst, it is provided to a Machine Learning Engineer to build predictive models based on the use case in hand. The field of machine learning is categorized into supervised, unsupervised, and reinforcement learning. The dataset is labeled in supervised unlike in unsupervised learning. To build a model, it is first trained with data to let them identify the patterns and learn from it to make predictions on the unknown set of data. The accuracy of the model is determined based on the metric, and the KPI used which is decided by the business beforehand.
  • Deep Learning – Deep Learning is a branch of Machine Learning which h uses neural network to make predictions. The neural networks work similar to our brain and makes builds predictive models compared to the traditional ML systems. Unlike in Machine Learning, no manual feature selection is required in Deep Learning but huge volumes of data and enormous computational power is needed to run deep learning frameworks. Some of the Deep Learning frameworks like TensorFlow, Keras, PyTorch.
  • Natural Language Processing – NLP or Natural Language Processing is a specialization in Data Science which deals with raw text. The natural language or speech is processed using several NLP libraries, and various hidden insights could be extracted from it. NLP has gained popularity in recent times with the amount of unstructured raw text that’s getting generated from a plethora of sources, and the unprecedented information that those natural data carries. Some of the applications of Natural Language Processing are Amazon’s Alexa, Google’s Siri. Even many companies are using NLP for sentiment analysis, resume parsing, and so on.
  • Data Visualization – Needless to say, the importance of presenting your insights either through scripting or with the help of various visualization tools. A lot of Data Science tasks could be solved with an accurate data visualizations as the charts, and the graphs presents enough hidden information for the business to take relevant decisions. Often, it gets difficult for an organization to build predictive models, and thus they rely on only visualizing the data for their workflow. Moreover, one needs to understand which graphs or charts to use for a particular business, and keep the visualization simple, as well as informative.
  • Domain Expertise – As mentioned earlier, professionals from different disciplines are using data in their business, and thus its wide range of applications makes it imperative for people to understand the domain they are applying their Data Science skills. The domain knowledge could be operations-related where you would leverage the tools to improve the business operations that could be focused on financials, logistics, etc.  It could also be sector specific such as Finance, Healthcare, etc.

Conclusion –

Data Science is a broad field with a multitude of skills, and technology that needs to be mastered. It is a life-long learning journey, and with frequent arrival of new technologies, one has to update themselves constantly.

Often it could be challenging to keep up with some frequent changes. Thus it is required to learn all these skills, and at least be a master of one particular skill. In a big corporation, a Data Science team would comprise of people assigned with different roles such as data engineering, modeling, and so on. Thus focusing on one particular area would give you an edge over others in finding a role within a Data Science team in an organization.

Data Scientist is the most sort after job in this decade, and it would continue to be so in years to come. Now is the right time to enter this field, and Dimensionless has several blogs and training to get started with Data Science.

Follow this link, if you are looking to learn more about data science online!


The 5-Minute Interview: Wouter Crooy, Technical Lead at albumprinter(Albelli)

$
0
0

Feed: Neo4j Graph Database Platform.
Author: Rachel Howard.
“Neo4j stood out of the NoSQL movement as having all the advantages of traditional relational databases,” said Wouter Crooy, Technical Lead at albumprinter.

While relational databases work well for some use cases, albumprinter relies on a polyglot persistence architecture to provide customers with a way to easily manage and create products from their personal photo collections.

In this week’s five-minute interview (conducted at GraphConnect in San Francisco), we discuss how the albumprinter team developed their graph recommendation engine — as well as exciting plans for the company’s future.

Talk to us about how you use Neo4j at albumprinter.

Wouter Crooy: We are a photo products company that creates moments that last, with products ranging from wall decor to cars. We saw a gap in the market for organizing photos and decided to start a company that would help our customers find and sort their photos and make products. We chose Neo4j because with a graph database, we can create relationships between photos. We do that based on the metadata we store in the Neo4j database.

What made Neo4j stand out?

Crooy: Neo4j stood out of the NoSQL movement as having all the advantages of traditional relational databases. It’s a reliable asset, we can continue working in transactions, and it combines the benefits of both NoSQL and relational databases. It also provides us with a database that is tight and reliable, which is especially important as a company that works with customer data.

Catch this week’s 5-Minute Interview with Wouter Crooy, Technical Lead at albumprinter

What are some of the most surprising results you’ve had working with Neo4j?

Crooy: We rely on polyglot persistence for our architecture, so we have multiple databases and use the one most suitable for each particular use case. Our scalability is cloud-based, and the ability to backup data is incredibly important for us.

One of the most surprising things we found along our journey was that we had to abandon the traditional relational model of storing data in tables and records, which was helped by the fact that we had a large domain model to store in the database. On our scale, we needed to relearn the way to normalize data. In a traditional database, you have multiple tables that you have to constantly debug. Now we need to debug the nodes by making sure they have fewer properties, and we had to learn how to “walk the graph.”

Knowing everything you know now, if you had to go back in time and start over with Neo4j, is there anything you would do differently?

Crooy: We would go back and optimize our data model. On the performance side, we were on the edge of what was possible, and there are definitely things to improve. So even though what we did with Neo4j was ultimately what we needed, there’s always stuff to learn that could have made our product better.

Anything else you’d like to add?

Crooy: It has been an exciting journey, and thanks to Neo4j, we had a short time to market when it came to creating all of these features. We are really happy with it.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com

Want to know how to use graphs in your industry?
Download this white paper, The Top 5 Use Cases of Graph Databases, and discover how to tap into the power of graphs for the connected enterprise.

Read the White Paper


Enterprise Database Solution – Maximizing Return for the Lowest Cost

$
0
0

Feed: Planet MySQL
;
Author: MySQL Performance Blog
;

PREVIOUS POST

enterprise database solutionIt used to be easy: a company developed a new application, chose a database solution, launched the new application and then tuned the chosen database solution. A team of DBAs looked at the infrastructure as well as the workload and made changes (or suggestions) as needed. The application then stayed in production for years and small tweaks were made as needed.

Those days are long gone.

As technology has evolved, so has the workflow and deployment strategy within the large enterprise.  Large, monolithic applications are being split into several microservices, generally decoupled but still working together and somewhat interdependent. Waterfall deployment strategies are replaced with agile methodology and continuous code deployment. Tuning and maintaining large installations of physical hardware has become less of the focus with the advent of virtualization, containerization, and orchestrated deployments.

Despite all of these changes and radical shifts in the market, one question for executives and management has remained constant: what approach should I use to maximize my return and give me the most productive environment for the lowest cost? As any good consultant will tell you, “it depends”.  Even with all the advances in technology, frameworks, and deployment strategies, there is still no silver bullet that achieves everything you need within your organization (while also preparing your meals and walking your dog).

Choosing an Enterprise Database Solution

In this post, we’ll discuss some of the paths you can take as a guide on your journey of choosing an enterprise database solution. It’s not meant to provide technical advice or suggest a “best option.”

Before going into some of the options, let’s put a few assumptions out there:

  • Your organization wants to use the right database solution for the job (or a few limited solutions)
  • You DO NOT want to rack new physical servers every time you need a new server or expect growth
  • Your application teams far outnumber your operations and database team (in terms of number of teams and overall members)
  • The question of “what does your application do” is more accurately replaced with several variations of “what does this particular application do”

Now that we have that out of the way, let’s start with buzzword number one: the cloud. While it is used all the time, there are a few different meanings. Originally (and most commonly), the cloud is referring to the “public” cloud — entities like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud.  When it first came to fruition, the most common barrier to organizations moving to the cloud was security. As more and more PII data is stored by large enterprises, the inherent fear of a breach in the public cloud led many companies to shy away. Although this is much less of a concern given all the advances in security, there are some instances where an organization might still believe that storing data in a “public” datacenter is a hard no. If this is your organization, feel free to skip ahead to the on-premise discussion below.

Public Cloud

Assuming that you can engineer proper security in the public cloud of your choosing, some of the main benefits of outsourcing your infrastructure quickly bubble to the top:

  • Elasticity
  • Flexibility
  • Agility
  • Cost

Elasticity

In many circumstances, you need increased capacity now, but only for a limited time. Does this scenario sound familiar? The beauty of the public cloud is that you generally only pay for what you are using. Looking at things from a long-term cost perspective, if you only need two times your capacity for two weeks out the year, why should you pay for half of your infrastructure to sit idle for the other fifty weeks annually?

Since you don’t have to actually maintain any physical gear in the public cloud, you have the ability to add/remove capacity as needed. There is no need to plan for or provision for additional hardware — and everything that comes with that (e.g., maintaining the cooling systems for double the number of data center servers, increased power costs, expanded physical space, etc.).

Flexibility / Agility

Most public clouds offer more than simply instant access to additional compute instances. There are managed services for several common use cases: relational databases, NoSQL databases, big data stores, message queues, and the list goes on. This flexibility is evident in using various managed services as glue to hold other managed services together.

In traditional environments, you may identify the need for a technology (think message queue), but opt against it due to the complexity of needing to actually manage it and use a less efficient alternative (a relational database for example). With these components readily available in most public clouds, your organization has the flexibility to use the correct technology for each use case without the burden of maintaining it.

Along with the flexibility of plugging in the appropriate technology, you greatly increase the speed at which this can be done. There is much less need from an infrastructure standpoint to plan for supporting a new technology. With the click of a button, the new technology is ready to go in your stack.  In an agile work environment, having an agile platform to accompany the methodology is very important.

Cost

While the above benefits are all really great, the bottom line is always (the most) important. Depending on how you determine the overall cost of your infrastructure (i.e., hardware only, or do you include operations staff, building costs, etc.) you can see cost savings. One of the big challenges with running physical gear is the initial cost. If I want to run a rack of 20 servers, I have to buy 20 servers, rack them up and turn them on. My ongoing operational cost is likely going to be less than in the cloud (remember, in the cloud you are paying as you use it), but I also need to spread the initial cost over time.

While an overall cost analysis is well outside the scope of this document, you can see how determining cost savings using the public cloud vs. an on-premise solution can be challenging. With all else being equal, you will generally have a more predictable monthly cost when using the public cloud and often can get volume (or reserved) discounts. For example, AWS provides a “CTO Calculator” to estimate how you could save on cost by switching to the public cloud: https://aws.amazon.com/tco-calculator/.

On-Premise

So the powers that be at your company have drawn a line in the sand and said “no” to using the public cloud. Does that mean that each time an application team needs a database, your operations team is racking a server and setting it up? It very well could, but let’s explore a few of the options available to your infrastructure team.

Dedicated Hardware

While this option can seem outdated, there are several benefits to provisioning bare metal machines in your data center:

  • Complete control over the machine
    • OS tuning
    • Hardware choices
    • Physical control
  • Easy to make different “classes” of machine
    • Spinning disks for DR slaves
    • SSD for slaves
    • Flash storage for masters
    • Etc
  • Easier troubleshooting
    • Less of a need to determine which “layer” is having problems
  • Less overhead for virtualization/containerization
  • No “extra servers” needed for managing the infrastructure

In a relatively static environment, this is still a great choice as you have full access and minimal layers to deal with. If you see disk errors, you don’t have to decide which “layer” is actually having problems – it is likely the disk. While this is nice, it can be cumbersome and a burden on your operations staff when there are always new databases being added (for microservices or scaling).

In this model, each server is assumed to be a static resource. Generally, you wouldn’t provision a bare metal machine with an OS and database and then wipe it and start over repeatedly. Rather, this model of deployment is best suited to an established application running a predictable workload, where scaling is slow and over time.

A major downside to this approach is resource utilization. Normally, you wouldn’t want to only use half of everything that you purchase. When dealing with bare metal machines, you generally don’t want to have everything running at maximum capacity all the time so that you can handle spikes in traffic.  When provisioning bare metal machines, this means you either have to pay for all of your potential resources and then watch most of them sit idle much of the time or risk outages while continuously running at the limits.

Virtualization/Containers

Right up there with “the cloud”, another buzzword these days is “containers”. At a high level, containers and virtualization are similar in that they both allow you to use part of a larger physical server to emulate a smaller server. This gives operations teams the ability to create “images” that can be used to quickly provision “servers” on larger bare metal machines.

While this does add a new layer to your stack, and can potentially introduce some additional complexity in tuning and/or troubleshooting, two major problems with bare metal provisioning are addressed:

  • Flexibility
  • Resource utilization

In terms of flexibility, operations teams are able to have a collection of standard images for various systems, such as application servers or database servers, and quickly spin them up on readily waiting hardware. This makes it much easier when an application team says “we need a new database for this service and will need four application servers with it.”  Rather than racking up and setting up five physical machines and installing the OS along with various packages, the operations team simply starts five virtual machines (or containers for those of you “containerites” out there) and hands them off.

This also helps with resource utilization. Rather than setting one application server up on a physical machine and keeping it under 50% utilization all the time, you are able to launch multiple VMs on this machine, each just using a portion. When the physical machine reaches maximum capacity, you can move an image to a new physical machine. This process gets rinsed and repeated as traffic patterns change and resource demands shift. It decreases some of the pain that comes from watching bare machines sit idle.

Private Cloud

Now, let’s put it all together and talk about creating a private cloud. It’s the best of both worlds, right?  All the flexibility and elasticity of the public cloud, but in your own data center where you can retain full control of everything. In this scenario, an organization is generally doing the following:

  • Managing a data center of generic, physical machines
  • Leveraging virtualization and/or containerization to quickly launch/destroy server images
  • Using an orchestration layer to manage all of the VMs/containers

This is a great fit for organizations that already have made an investment in a large physical infrastructure. You likely already have hundreds of servers at your disposal, so why not get the most utilization you can out of them and make your infrastructure much more dynamic?

Consider this…

While this sounds amazing (and quite often IS the best fit), here’s what to consider.  When dealing with a large internal cloud, you will need people experienced in managing this sort of infrastructure. Even though application teams now just hit a button to launch a database and application server, the cloud is still backed by a traditional data center with bare metal servers. An operations team is still a very needed entity — even though they may not be your traditional “DBA” or “ops guy”.

Also, the complexity of managing (and definitely troubleshooting) an environment such as this generally increases by an order of magnitude. Generic questions like “why is my application running slow?” used to be easier to answer: you check the application server and the database server, look at some metrics, and can generally pinpoint what is happening. In a large private cloud, now you’ll need to look at:

  • Application/query layer
  • Orchestration layer
  • Virtualization / container layer
  • Physical layer

It is not to say it isn’t worth it, but managing an internal cloud is not a trivial task and much thought needs to be put in.

How Can Percona Help?

Having been in the open source database space for years, Percona has seen and worked on just about every possible MySQL deployment possible. We also focus on picking the proper tool for the job and will meet your organization where you are. Running Postgres on bare metal servers? We can help.  Serving your application off of EC2 instances backed by an RDS database? No problem. MongoDB on Kubernetes in your private cloud? Check.

We can also work with your organization to help you choose the best path to follow. We love open source databases and the flexibility that they can provide. Our team has experience designing and deploying architectures ranging from a single database cloud server to hundreds of bare metal machines spanning across multiple data centers. With that sort of experience, we can help your organization with an enterprise database solution too!

Contact Me ►
Photo by Carl Nenzen Loven on Unsplash
PREVIOUS POST

ScaleGrid DBaaS Expands MySQL Hosting Services Through AWS Cloud

$
0
0

Feed: Planet MySQL
;
Author: ScaleGrid.io
;

PALO ALTO, Calif., June 6, 2019 – ScaleGrid, the Database-as-a-Service (DBaaS) leader in the SQL and NoSQL space, has announced the expansion of their fully managed MySQL Hosting services to support Amazon Web Services (AWS) cloud. The platform allows MySQL AWS administrators to automate their time-consuming database operations in the cloud and improve their performance with high availability, disaster recovery, polyglot persistence, and advanced monitoring and analytics.

MySQL Slow Query Analyzer - ScaleGrid DBaaS

Over the years, migrating data to the cloud has become a top priority for organizations looking to modernize their infrastructure for improved security, performance, and agility, closely followed by the trending shift from commercial database management systems to open source databases. It comes as no surprise the #1 open source database and most popular cloud provider in the world are a natural fit for this transition.

ScaleGrid’s solution brings a unique twist to the table through their Bring Your Own Cloud (BYOC) plans which allow you to host MySQL through your own AWS account. While all other DBaaS platforms require you to host through their service, ScaleGrid encourages users to host in the safety of their own accounts so they can leverage advanced security features like AWS Virtual Private Clouds (VPC) to protect their data from the internet, and Security Groups to lock down access to their servers. Additionally, organizations are able to leverage Reserved Instances through ScaleGrid’s BYOC plan, allowing them to save up to 60% on their long-term database hosting costs.

ScaleGrid DBaaS Expands MySQL Hosting Services Through AWS CloudClick To Tweet

“Having to give up administrative control has been one of the biggest roadblocks to enterprise DBaaS adoption,” says Dharshan Rangegowda, Founder and CEO of ScaleGrid. “At ScaleGrid, we keep power in the hands of our users with full MySQL superuser admin privileges and SSH access to your machines so you don’t have to sacrifice control to leverage a managed database service.”

ScaleGrid first added support for MySQL on Azure back in November, 2018, joining their fully managed MongoDB Hosting and Redis Hosting open source database family of services, and PostgreSQL Hosting is expected in June of 2019. With the addition of MySQL on AWS, ScaleGrid customers can now deploy this SQL database across five North American regions, one South American, three European, and six Asia Pacific AWS regions.

Start a free MySQL trial to see how ScaleGrid can help you optimize your deployments.

Learn More About MySQL Hosting

MySQL Webcasts On Demand en Español & English

$
0
0

Feed: Planet MySQL
;
Author: Keith Hollman
;

Por si no se hubiera visto o promocionado lo suficiente, quería compartir la lista de webcasts en Español que ya hay disponible en On Demand webinars en el apartado de News & Events en mysql.com:

https://www.mysql.com/news-and-events/on-demand-webinars/#es-20-0

Sobre 1 hora de duración cada una, aquí tenéis algunos ejemplos:

MySQL InnoDB Cluster: Una introducción y Demo

MySQL, NoSQL, JSON, JS, Python: Document Store. (+demo)

MySQL 8: Nuevas Funcionalidades

MySQL Enterprise Monitor + demo

MySQL Enterprise Backup: Introducción

Introducción al DBA Oracle: MySQL & Oracle

Espero que pueda interesar!

Advertisements

An Overview of PostgreSQL to MySQL Cross Replication

$
0
0

Feed: Planet MySQL
;
Author: Severalnines
;

This blog is aimed at explaining an overview of cross replication between PostgreSQL and MySQL, and further discussing the methods of configuring cross replication between the two database servers. Traditionally, the databases involved in a cross replication setup are called heterogeneous databases, which is a good approach to move away from one RDBMS server to another.

Both PostgreSQL and MySQL databases are conventionally RDBMS databases but they also offer NoSQL capability with added extensions to have the best of both worlds. This article focuses on the discussion of replication between PostgreSQL and MySQL from an RDBMS perspective.

An exhaustive explanation about internals of replication is not within the purview of this blog, however, some foundational elements shall be discussed to give the audience an understanding of how is replication configured between database servers, advantages, limitations and perhaps some known use cases.

In general replication between two identical database servers is achieved either in binary mode or query mode between a master node (otherwise called publisher, primary or active) and a slave node (subscriber, standby or passive). The aim of replication is to provide a real time copy of the master database on the slave side, where the data is transferred from master to slave, thereby forming an active-passive setup because the replication is only configured to occur one way. On the other hand, replication between two databases can be configured both ways so the data can also be transferred from slave back to master, establishing an active-active configuration. All of this can be configured between two or more identical database servers which may also include a cascading replication. The configuration of active-active or active-passive really depends on the business need, availability of such features within the native configuration or utilizing external solutions to configure and applicable trade-offs.

The above mentioned configuration can be accomplished with diverse database servers, wherein a database server can be configured to accept replicated data from another completely different database server and still maintain real time snapshot of the data being replicated. Both MySQL and PostgreSQL database servers offer most of the configurations discussed above either in their own nativity or with the help of third party extensions including binary log method, disk block method, statement based and row based methods.

The requirement to configure a cross replication between MySQL and PostgreSQL really comes in as a result of a one time migration effort to move away from one database server to another. As both the databases use different protocols so they cannot directly talk to each other. In order to achieve that communication flow, there is an external open source tool such as pg_chameleon.

Background of pg_chameleon

pg_chameleon is a MySQL to PostgreSQL replication system developed in Python 3. It uses an open source library called mysql-replication which is also developed using Python. The functionality involves pulling row images of MySQL tables and storing them as JSONB objects into a PostgreSQL database, which is further decoded by a pl/pgsql function and replaying those changes against the PostgreSQL database.

Features of pg_chameleon

  • Multiple MySQL schemas from the same cluster can be replicated to a single target PostgreSQL database, forming a many-to-one replication setup
  • The source and target schema names can be non-identical
  • Replication data can be pulled from MySQL cascading replica
  • Tables that fail to replicate or generate errors are excluded
  • Each replication functionality is managed with the help of daemons
  • Controlled with the help of parameters and configuration files based on YAML construct

Demo

Host vm1 vm2
OS version CentOS Linux release 7.6 x86_64 CentOS Linux release 7.5 x86_64
Database server with version MySQL 5.7.26 PostgreSQL 10.5
Database port 3306 5433
ip address 192.168.56.102 192.168.56.106

To begin with, prepare the setup with all the prerequisites needed to install pg_chameleon. In this demo Python 3.6.8 is installed, creating a virtual environment and activating it for use.

$> wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tar.xz
$> tar -xJf Python-3.6.8.tar.xz
$> cd Python-3.6.8
$> ./configure --enable-optimizations
$> make altinstall

Following a successful installation of Python3.6, further additional requirements are met such as creating and activating a virtual environment. In addition to that pip module upgraded to the latest version and it is used to install pg_chameleon. In the commands below, pg_chameleon 2.0.9 was deliberately installed whereas the latest version is a 2.0.10. This is done in order to avoid any newly introduced bugs in the updated version.

$> python3.6 -m venv venv
$> source venv/bin/activate
(venv) $> pip install pip --upgrade
(venv) $> pip install pg_chameleon==2.0.9

The next step is to invoke the pg_chameleon (chameleon is the command) with set_configuration_files argument to enable pg_chameleon to create default directories and configuration files.

(venv) $> chameleon set_configuration_files
creating directory /root/.pg_chameleon
creating directory /root/.pg_chameleon/configuration/
creating directory /root/.pg_chameleon/logs/
creating directory /root/.pg_chameleon/pid/
copying configuration  example in /root/.pg_chameleon/configuration//config-example.yml

Now, create a copy of config-example.yml as default.yml to make it the default configuration file. A sample configuration file used for this demo is provided below.

$> cat default.yml
---
#global settings
pid_dir: '~/.pg_chameleon/pid/'
log_dir: '~/.pg_chameleon/logs/'
log_dest: file
log_level: info
log_days_keep: 10
rollbar_key: ''
rollbar_env: ''

# type_override allows the user to override the default type conversion into a different one.
type_override:
  "tinyint(1)":
    override_to: boolean
    override_tables:
      - "*"

#postgres  destination connection
pg_conn:
  host: "192.168.56.106"
  port: "5433"
  user: "usr_replica"
  password: "pass123"
  database: "db_replica"
  charset: "utf8"

sources:
  mysql:
    db_conn:
      host: "192.168.56.102"
      port: "3306"
      user: "usr_replica"
      password: "pass123"
      charset: 'utf8'
      connect_timeout: 10
    schema_mappings:
      world_x: pgworld_x
    limit_tables:
#      - delphis_mediterranea.foo
    skip_tables:
#      - delphis_mediterranea.bar
    grant_select_to:
      - usr_readonly
    lock_timeout: "120s"
    my_server_id: 100
    replica_batch_size: 10000
    replay_max_rows: 10000
    batch_retention: '1 day'
    copy_max_memory: "300M"
    copy_mode: 'file'
    out_dir: /tmp
    sleep_loop: 1
    on_error_replay: continue
    on_error_read: continue
    auto_maintenance: "disabled"
    gtid_enable: No
    type: mysql
    skip_events:
      insert:
        - delphis_mediterranea.foo #skips inserts on the table delphis_mediterranea.foo
      delete:
        - delphis_mediterranea #skips deletes on schema delphis_mediterranea
      update:

The configuration file used in this demo is the sample file that comes with pg_chameleon with minor edits to suit the source and destination environments, and a summary of different sections of the configuration file follows.

The default.yml configuration file has a “global settings” section that control details such as lock file location, logging locations and retention period, etc. The section that follows next is the “type override” section which is a set of rules to override types during replication. A sample type override rule is used by default which converts a tinyint(1) to a boolean value. The next section is the destination database connection details section which in our case is a PostgreSQL database, denoted by “pg_conn”. The final section is the source section which has all the details of source database connection settings, schema mapping between source and destination, any tables to skip including timeout, memory and batch size settings. Notice the “sources” denoting that there can be multiple sources to a single destination to form a many-to-one replication setup.

A “world_x” database is used in this demo which is a sample database with 4 tables containing sample rows, that MySQL community offers for demo purposes, and it can be downloaded from here. The sample database comes as a tar and compressed archive along with instructions to create it and import rows in it.

A dedicated user is created in both the MySQL and PostgreSQL databases with the same name as usr_replica that is further granted additional privileges on MySQL to have read access to all the tables being replicated.

mysql> CREATE USER usr_replica ;
mysql> SET PASSWORD FOR usr_replica='pass123';
mysql> GRANT ALL ON world_x.* TO 'usr_replica';
mysql> GRANT RELOAD ON *.* to 'usr_replica';
mysql> GRANT REPLICATION CLIENT ON *.* to 'usr_replica';
mysql> GRANT REPLICATION SLAVE ON *.* to 'usr_replica';
mysql> FLUSH PRIVILEGES;

A database is created on the PostgreSQL side that will accept changes from MySQL database, which is named as “db_replica”. The “usr_replica” user in PostgreSQL is automatically configured as an owner of two schemas such as “pgworld_x” and “sch_chameleon” that contain the actual replicated tables and catalog tables of replication respectively. This automatic configuration is done by the create_replica_schema argument, indicated further below.

postgres=# CREATE USER usr_replica WITH PASSWORD 'pass123';
CREATE ROLE
postgres=# CREATE DATABASE db_replica WITH OWNER usr_replica;
CREATE DATABASE

The MySQL database is configured with a few parameter changes in order to prepare it for replication, as shown below, and it requires a database server restart for the changes to take effect.

$> vi /etc/my.cnf
binlog_format= ROW
binlog_row_image=FULL
log-bin = mysql-bin
server-id = 1

At this point, it is significant to test the connectivity to both the database servers to ensure there are no issues when pg_chameleon commands are executed.

On the PostgreSQL node:

$> mysql -u usr_replica -Ap'admin123' -h 192.168.56.102 -D world_x 

On the MySQL node:

$> psql -p 5433 -U usr_replica -h 192.168.56.106 db_replica

The next three commands of pg_chameleon (chameleon) is where it sets the environment up, adds a source and initializes a replica. The “create_replica_schema” argument of pg_chameleon creates the default schema (sch_chameleon) and replication schema (pgworld_x) in the PostgreSQL database as has already been discussed. The “add_source” argument adds the source database to the configuration by reading the configuration file (default.yml), which in this case is “mysql”, while the “init_replica” initializes the configuration based on the settings of the configuration file.

$> chameleon create_replica_schema --debug
$> chameleon add_source --config default --source mysql --debug
$> chameleon init_replica --config default --source mysql --debug

The output of the above three commands is self explanatory indicating the success of each command with an evident output message. Any failures or syntax errors are clearly mentioned in simple and plain messages, thereby suggesting and prompting corrective actions.

The final step is to start the replication with “start_replica”, the success of which is indicated by an output hint as shown below.

$> chameleon start_replica --config default --source mysql 
output: Starting the replica process for source mysql

The status of replication can be queried with the “show_status” argument while errors can be viewed with ‘show_errors” argument.

$> chameleon show_status --source mysql  
OUTPUT: 
  Source id  Source name    Type    Status    Consistent    Read lag    Last read    Replay lag    Last replay
-----------  -------------  ------  --------  ------------  ----------  -----------  ------------  -------------
          1  mysql          mysql   running   No            N/A                      N/A

== Schema mappings ==
Origin schema    Destination schema
---------------  --------------------
world_x          pgworld_x

== Replica status ==
---------------------  ---
Tables not replicated  0
Tables replicated      4
All tables             4
Last maintenance       N/A
Next maintenance       N/A
Replayed rows
Replayed DDL
Skipped rows
---------------------  ---
$> chameleon show_errors --config default 
output: There are no errors in the log

As discussed earlier that each of the replication functionality is managed with the help of daemons, which can be viewed by querying the process table using Linux “ps” command, exhibited below.

$>  ps -ef|grep chameleon
root       763     1  0 19:20 ?        00:00:00 /u01/media/mysql_samp_dbs/world_x-db/venv/bin/python3.6 /u01/media/mysq l_samp_dbs/world_x-db/venv/bin/chameleon start_replica --config default --source mysql
root       764   763  0 19:20 ?        00:00:01 /u01/media/mysql_samp_dbs/world_x-db/venv/bin/python3.6 /u01/media/mysq l_samp_dbs/world_x-db/venv/bin/chameleon start_replica --config default --source mysql
root       765   763  0 19:20 ?        00:00:00 /u01/media/mysql_samp_dbs/world_x-db/venv/bin/python3.6 /u01/media/mysq l_samp_dbs/world_x-db/venv/bin/chameleon start_replica --config default --source mysql

No replication setup is complete until it is put to the “real-time apply” test, which has been simulated as below. It involves creating a table and inserting a couple of records in the MySQL database, subsequently, the “sync_tables” argument of pg_chameleon is invoked to update the daemons to replicate the table along with its records to the PostgreSQL database.

mysql> create table t1 (n1 int primary key, n2 varchar(10));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t1 values (1,'one');
Query OK, 1 row affected (0.00 sec)
mysql> insert into t1 values (2,'two');
Query OK, 1 row affected (0.00 sec)
$> chameleon sync_tables --tables world_x.t1 --config default --source mysql
Sync tables process for source mysql started.

The test is confirmed by querying the table from PostgreSQL database to reflect the rows.

$> psql -p 5433 -U usr_replica -d db_replica -c "select * from pgworld_x.t1";
 n1 |  n2
----+-------
  1 | one
  2 | two

If it is a migration project then the following pg_chameleon commands will mark the end of the migration effort. The commands should be executed after it is confirmed that rows of all the target tables have been replicated across, and the result will be a cleanly migrated PostgreSQL database without any references to the source database or replication schema (sch_chameleon).

$> chameleon stop_replica --config default --source mysql 
$> chameleon detach_replica --config default --source mysql --debug

Optionally the following commands will drop the source configuration and replication schema.

$> chameleon drop_source --config default --source mysql --debug
$> chameleon drop_replica_schema --config default --source mysql --debug

Pros of Using pg_chameleon

  • Simple to setup and less complicated configuration
  • Painless troubleshooting and anomaly detection with easy to understand error output
  • Additional adhoc tables can be added to the replication after initialization, without altering any other configuration
  • Multiple sources can be configured for a single destination database, which is useful in consolidation projects to merge data from one or more MySQL databases into a single PostgreSQL database
  • Selected tables can be skipped from being replicated

Cons of Using pg_chameleon

  • Only supported from MySQL 5.5 onwards as Origin database and PostgreSQL 9.5 onwards for destination database
  • Requires every table to have a primary or unique key, otherwise, the tables get initialized during the init_replica process but they will fail to replicate
  • One way replication, i.e., MySQL to PostgreSQL. Thereby limiting its use to only an active-passive setup
  • The source database can only be a MySQL database while support for PostgreSQL database as source is experimental with further limitations (click here to learn more)

pg_chameleon Summary

The replication approach offered by pg_chameleon is favourable to a database migration of MySQL to PostgreSQL. However, one of the significant limitations of one-way replication can discourage database professionals to adopt it for anything other than migration. This drawback of unidirectional replication can be addressed using yet another open source tool called SymmetricDS.

In order to study the utility more in detail, please refer to the official documentation here. The command line reference can be obtained from here.

Download the Whitepaper Today
PostgreSQL Management & Automation with ClusterControl
Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

An Overview of SymmetricDS

SymmetricDS is an open source tool that is capable of replicating any database to any other database, from the popular list of database servers such as Oracle, MongoDB, PostgreSQL, MySQL, SQL Server, MariaDB, DB2, Sybase, Greenplum, Informix, H2, Firebird and other cloud based database instances such as Redshift and Azure etc. Some of the offerings include database and file synchronization, multi-master replication, filtered synchronization, and transformation. The tool is developed using Java, requiring a standard edition (version 8.0 or above) of either JRE or JDK. The functionality involves data changes being captured by triggers at source database and routing it to a participating destination database as outgoing batches

Features of SymmetricDS

  • Platform independent, which means two or more dissimilar databases can communicate with each other, any database to any other database
  • Relational databases achieve synchronization using change data capture while file system based systems utilize file synchronization
  • Bi-directional replication using Push and Pull method, which is accomplished based on set rules
  • Data transfer can also occur over secure and low bandwidth networks
  • Automatic recovery during the resumption of a crashed node and automatic conflict resolution
  • Cloud ready and contains powerful extension APIs

Demo

SymmetricDS can be configured in one of the two options:

  • A master (parent) node that acts as a centralized intermediary coordinating data replication between two slave (child) nodes, in which the communication between the two child nodes can only occur via the parent.
  • An active node (node1) can replicate to and from another active node (node2) without any intermediary.

In both the options, the communication between the nodes happens via “Push” and “Pull” events. In this demo, an active-active configuration between two nodes will be explained. The full architecture can be exhaustive, so the readers are encouraged to check the user guide available here to learn more about the internals of SymmetricDS.

Installing SymmetricDS is as simple as downloading the open source version of zip file from here and extracting it in a convenient location. The details of install location and version of SymmetricDS in this demo are as per the table below, along with other details pertaining to database versions, Linux versions, ip addresses and communication port for both the participating nodes.

Host vm1 vm2
OS version CentOS Linux release 7.6 x86_64 CentOS Linux release 7.6 x86_64
Database server version MySQL 5.7.26 PostgreSQL 10.5
Database port 3306 5832
ip address 192.168.1.107 192.168.1.112
SymmetricDS version SymmetricDS 3.9 SymmetricDS 3.9
SymmetricDS install location /usr/local/symmetric-server-3.9.20 /usr/local/symmetric-server-3.9.20
SymmetricDS node name corp-000 store-001

The install home in this case is “/usr/local/symmetric-server-3.9.20” which will be the home directory of SymmetricDS, which contains various other sub-directories and files. Two of the sub-directories that are of importance now are “samples” and “engines”. The samples directory contains node properties configuration file samples in addition to sample SQL scripts to kick start a quick demo.

The following three node properties configuration files can be seen in the “samples” directory with names indicating the nature of node in a given setup.

corp-000.properties
store-001.properties
store-002.properties

As SymmetricDS comes with all the necessary configuration files to support a basic 3 node setup (option 1), it is convenient to use the same configuration files to setup a 2 node setup (option 2) as well. The intended configuration file is copied from the “samples” directory to the “engines” on host vm1, and it looks like below.

$> cat engines/corp-000.properties
engine.name=corp-000
db.driver=com.mysql.jdbc.Driver
db.url=jdbc:mysql://192.168.1.107:3306/replica_db?autoReconnect=true&useSSL=false
db.user=root
db.password=admin123
registration.url=
sync.url=http://192.168.1.107:31415/sync/corp-000
group.id=corp
external.id=000

The name of this node in SymmetricDS configuration is “corp-000” with the database connection handled with mysql jdbc driver using the connection string as stated above along with login credentials. The database to connect is “replica_db” and the tables will be created during the creation of sample schema. The “sync.url” denotes the location to contact the node for synchronization.

The node 2 on host vm2 is configured as “store-001” with the rest of the details as configured in the node.properties file, shown below. The “store-001” node runs a PostgreSQL database, with “pgdb_replica” as the database for replication. The “registration.url” enables host “vm2” to communicate with host “vm1” to pull configuration details.

$> cat engines/store-001.properties
engine.name=store-001
db.driver=org.postgresql.Driver
db.url=jdbc:postgresql://192.168.1.112:5832/pgdb_replica
db.user=postgres
db.password=admin123
registration.url=http://192.168.1.107:31415/sync/corp-000
group.id=store
external.id=001

The pre-configured default demo of SymmetricDS contains settings to setup a bi-directional replication between two database servers (two nodes). The steps below are executed on host vm1 (corp-000), which will create a sample schema having 4 tables. Further, execution of “create-sym-tables” with “symadmin” command will create the catalog tables that store and control the rules and direction of replication between nodes. Finally, the demo tables are loaded with sample data.

vm1$> cd /usr/local/symmetric-server-3.9.20/bin
vm1$> ./dbimport --engine corp-000 --format XML create_sample.xml
vm1$> ./symadmin --engine corp-000 create-sym-tables
vm1$> ./dbimport --engine corp-000 insert_sample.sql

The demo tables “item” and “item_selling_price” are auto-configured to replicate from corp-000 to store-001 while the sale tables (sale_transaction and sale_return_line_item) are auto-configured replicate from store-001 to corp-000. The next step is to create the sample schema in the PostgreSQL database on host vm2 (store-001), in order to prepare it to receive data from corp-000.

vm2$> cd /usr/local/symmetric-server-3.9.20/bin
vm2$> ./dbimport --engine store-001 --format XML create_sample.xml

It is important to verify the existence of demo tables and SymmetricDS catalog tables in the MySQL database on vm1 at this stage. Note, the SymmetricDS system tables (tables with prefix “sym_”) are only available in the corp-000 node at this point of time, because that is where the “create-sym-tables” command was executed, which will be the place to control and manage the replication. In addition to that, the store-001 node database will only have 4 demo tables with no data in it.

The environment is now ready to start the “sym” server processes on both the nodes, as show below.

vm1$> cd /usr/local/symmetric-server-3.9.20/bin
vm1$> sym 2>&1 &

The log entries are both sent to a background log file (symmetric.log) under a logs directory in the SymmetricDS install location as well as to the standard output. The “sym” server can now be initiated on store-001 node.

vm2$> cd /usr/local/symmetric-server-3.9.20/bin
vm2$> sym 2>&1 &

The startup of “sym” server process on host vm2 will create the SymmetricDS catalog tables in the PostgreSQL database as well. The startup of “sym” server process on both the nodes will get them to coordinate with each other to replicate data from corp-000 to store-001. After a few seconds, querying all the four tables on either side will show the successful replication results. Alternatively, an initial load can also be sent to the store-001 node from corp-000 with the below command.

vm1$> ./symadmin --engine corp-000 reload-node 001

At this point, a new record is inserted into the “item” table in MySQL database at corp-000 node (host: vm1) and it can be verified to have successfully replicated to the PostgreSQL database at store-001 node (host: vm2). This shows the “Pull” event of data from corp-000 to store-001.

mysql> insert into item values ('22000002','Jelly Bean');
Query OK, 1 row affected (0.00 sec)
vm2$> psql -p 5832 -U postgres pgdb_replica -c "select * from item" 
 item_id  |   name
----------+-----------
 11000001 | Yummy Gum
 22000002 | Jelly Bean
(2 rows)

The “Push” event of data from store-001 to corp-000 can be achieved by inserting a record into the “sale_transaction” table and confirming it to replicate through.

pgdb_replica=# insert into "sale_transaction" ("tran_id", "store_id", "workstation", "day", "seq") values (1000, '001', '3', '2007-11-01', 100);
vm1$> [root@vm1 ~]#  mysql -uroot -p'admin123' -D replica_db -e "select * from sale_transaction";
+---------+----------+-------------+------------+-----+
| tran_id | store_id | workstation | day        | seq |
+---------+----------+-------------+------------+-----+
|     900 | 001      | 3           | 2012-12-01 |  90 |
|    1000 | 001      | 3           | 2007-11-01 | 100 |
|    2000 | 002      | 2           | 2007-11-01 | 200 |
+---------+----------+-------------+------------+-----+

This marks the successful configuration of bidirectional replication of demo tables between a MySQL and PostgreSQL database. Whereas, the configuration of replication for newly created user tables can be achieved using the following steps. An example table “t1” is created for the demo and the rules of its replication are configured as per the procedure below. The steps only configure the replication from corp-000 to store-001.

mysql> create table  t1 (no integer);
Query OK, 0 rows affected (0.01 sec)
mysql> insert into sym_channel (channel_id,create_time,last_update_time) 
values ('t1',current_timestamp,current_timestamp);
Query OK, 1 row affected (0.01 sec)
mysql> insert into sym_trigger (trigger_id, source_table_name,channel_id,
last_update_time, create_time) values ('t1', 't1', 't1', current_timestamp,
current_timestamp);
Query OK, 1 row affected (0.01 sec)
mysql> insert into sym_trigger_router (trigger_id, router_id,
Initial_load_order, create_time,last_update_time) values ('t1',
'corp-2-store-1', 1, current_timestamp,current_timestamp);
Query OK, 1 row affected (0.01 sec)

After this, the configuration is notified about the schema change of adding a new table by invoking the symadmin command with “sync-triggers” argument which will recreate the triggers to match table definitions. Subsequently, execute “send-schema” to send schema changes out to store-001 node, following which the replication of “t1” table will be configured successfully.

vm1$> ./symadmin -e corp-000 --node=001 sync-triggers    
vm1$> ./symadmin send-schema -e corp-000 --node=001 t1

Pros of Using SymmetricDS

  • Effortless installation and configuration including a pre-configured set of parameter files to build either a 3-node or a 2-node setup
  • Cross platform database enabled and platform independent including servers, laptops and mobile devices
  • Replicate any database to any other database, whether on-prem, WAN or cloud
  • Capable of optimally handling a couple of databases to several thousand databases to replicate data seamlessly
  • A commercial version of the software offers GUI driven management console with an excellent support package

Cons of Using SymmetricDS

  • Manual command line configuration may involve defining rules and direction of replication via SQL statements to load catalog tables, which may be inconvenient to manage
  • Setting up a large number of tables for replication will be an exhaustive effort, unless some form of scripting is utilized to generate the SQL statements defining rules and direction of replication
  • Plenty of logging information cluttering the logfile, thereby requiring periodic logfile maintenance to not allow the logfile to fill up the disk

SymmetricDS Summary

SymmetricDS offers the ability to setup bi-directional replication between 2 nodes, 3 nodes and so on for several thousand nodes to replicate data and achieve file synchronization. It is a unique tool that performs many of the self-healing maintenance tasks such as the automatic recovery of data after extended periods of downtime in a node, secure and efficient communication between nodes with the help of HTTPS and automatic conflict management based on set rules, etc. The essential feature of replicating any database to any other database makes SymmetricDS ready to be deployed for a number of use cases including migration, version and patch upgrade, distribution, filtering and transformation of data across diverse platforms.

The demo was created by referring to the official quick-start tutorial of SymmetricDS which can be accessed from here. The user guide can be found here, which provides a detailed account of various concepts involved in a SymmetricDS replication setup.

Viewing all 521 articles
Browse latest View live