Quantcast
Channel: NoSQL – Cloud Data Architect
Viewing all 521 articles
Browse latest View live

Microsoft Azure for the Gaming Industry

$
0
0

Feed: Microsoft Azure Blog.
Author: Mujtaba Hamid.

This blog post was co-authored by Patrick Mendenall, Principal Program Manager, Azure. 

We are excited to join the Game Developers Conference (GDC) this week to learn what’s new and share our work in Azure focused on enabling modern, global games via cloud and cloud-native technologies.

Cloud computing is increasingly important for today’s global gaming ecosystem, empowering developers of any size to reach gamers in any part of the world. Azure’s 54 datacenter regions, and its robust global network, provides globally available, high performance services, as well as a platform that is secure, reliable, and scalable to meet current and emerging infrastructure needs. For example, earlier this month we announced the availability of Azure South Africa regions. Azure services enable every phase of the game development lifecycle from designing, building, testing, publishing, monetizing, measurement, engagement, and growth, providing:

  • Compute: Gaming services rely on a robust, reliable, and scalable compute platform. Azure customers can choose from a range of compute- and memory-optimized Linux and Windows VMs to run their workloads, services, and servers, including auto-scaling, microservices, and functions for modern, cloud-native games.
  • Data: The cloud is changing the way applications are designed, including how data is processed and stored. Azure provides high availability, global data, and analytics solutions based on both relational databases as well as big data solutions.
  • Networking: Azure operates one of the largest dedicated long-haul network infrastructures worldwide, with over 70,000 miles of fiber and sub-sea cable, and over 130+ edge sites. Azure offers customizable networking options to allow for fast, scalable, and secure network connectivity between customer premises and global Azure regions.
  • Scalability: Azure offers nearly unlimited scalability. Given the cyclical usage patterns of many games, using Azure enables organizations to rapidly increase and/or decrease the number of cores needed, while only having to pay for the resources that are used.
  • Security: Azure offers a wide array of security tools and capabilities, to enable customers to secure their platform, maintain privacy and controls, meet compliance requirements (including GDPR), and ensure transparency.
  • Global presence: Azure has more regions globally than any other cloud provider, offering the scale needed to bring games and data closer to users around the world, preserving data residency, and providing comprehensive compliance and resiliency options for customers. Using Azure’s footprint, the cost, the time, and the complexity of operating a game at global scale can be reduced.
  • Open: with Azure you can use the software you choose whether it be operating systems, engines, database solutions, or open source – run it on Azure.

We’re also excited to bring PlayFab into the Azure family. Together, Azure and PlayFab are a powerful combination for game developers. Azure brings reliability, global scale, and enterprise-level security, while PlayFab provides Game Stack with managed game services, real-time analytics, and comprehensive LiveOps capabilities.

We look forward to meeting many of you at GDC 2019 to learn about your ideas in gaming, discussing where cloud and cloud-native technologies can enable your vision, and sharing more details on Azure for gaming. Join us at the conference or contact our gaming industry team at azuregaming@microsoft.com.

Details on all of these are available via links below.

  • Learn more about Microsoft Game Stack.
  • Talks at GDC:
  • Azure Gaming Reference Architectures: Landing Page
  • GDC Booth demos for Azure:
    • AI Training with Containers – Use Azure and Kubernetes to power Unity ML Agents
    • Game Telemetry – Build better game balance and design
    • Build NoSQL Data Platforms – Azure Cosmos DB: a globally distributed, massively scalable NoSQL database service
    • Cross Realms with SQL – Build powerful databases with Azure SQL

Microsoft’s Azure Cosmos DB is named a leader in the Forrester Wave: Big Data NoSQL

$
0
0

Feed: Microsoft Azure Blog.
Author: Rimma Nehme.

We’re excited to announce that Forrester has named Microsoft as a Leader in The Forrester Wave™: Big Data NoSQL, Q1 2019 based on their evaluation of Azure Cosmos DB. We believe Forrester’s findings validate the exceptional market momentum of Azure Cosmos DB and how happy our customers are with the product.

NoSQL platforms are on the rise

According to Forrester, “half of global data and analytics technology decision makers have either implemented or are implementing NoSQL platforms, taking advantage of the benefits of a flexible database that serves a broad range of use cases…While many organizations are complementing their relational databases with NoSQL, some have started to replace them to support improved performance, scale, and lower their database costs.”

Azure Cosmos DB has market momentum

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database service for mission-critical workloads. Azure Cosmos DB provides turnkey global distribution with unlimited endpoint scalability, elastic scaling of throughput (at multiple granularities, e.g., database, key-space, tables and collections) and storage worldwide, single-digit millisecond latencies at the 99th percentile, five well-defined consistency models, and guaranteed high availability, all backed by the industry-leading comprehensive SLAs. Azure Cosmos DB automatically indexes all data without requiring developers to deal with schema or index management. It is a multi-model service, which natively supports document, key-value, graph, and column-family data models. As a natively born in the cloud service, Azure Cosmos DB is carefully engineered with multitenancy and global distribution from the ground up. As a foundational service in Azure, Azure Cosmos DB is ubiquitous, running in all public regions, DoD and sovereign clouds, with industry-leading compliance certification list, enterprise grade security – all without any extra cost.

Azure Cosmos DB’s unique approach of providing wire protocol compatible APIs for the popular open source-databases ensures that you can continue to use Azure Cosmos DB in a cloud-agnostic manner while still leveraging a robust database platform natively designed for the cloud. You get the flexibility to run your Cassandra, Gremlin, MongoDB apps fully managed with no vendor lock-in. While Azure Cosmos DB exposes APIs for the popular open source databases, it does not rely on the implementations of those databases for realizing the semantics of the corresponding APIs.

image

According to the Forrester report, Azure Cosmos DB is starting to achieve strong traction and “Its simplified database with relaxed consistency levels and low-latency access makes it easier to develop globally distributed apps.” Forrester mentioned specifically that “Customer references like its resilience, low maintenance, cost effectiveness, high scalability, multi-model support, and faster time-to-value.”

Forrester notes Azure Cosmos DB’s global availability across all Azure regions and how customers use it for operational apps, real-time analytics, streaming analytics and Internet-of-Things (IoT) analytics. Azure Cosmos DB powers many worldwide enterprises and Microsoft services such as XBox, Skype, Teams, Azure, Office 365, and LinkedIn.

To fulfill their vision, in addition to operational data processing, organizations using Azure Cosmos DB increasingly invest in artificial intelligence (AI) and machine learning (ML) running on top of globally-distributed data in Azure Cosmos DB. Azure Cosmos DB enables customers to seamlessly build, deploy, and operate low latency machine learning solutions on the planet scale data. The deep integration with Spark and Azure Cosmos DB enables the end-to-end ML workflow – managing, training and inferencing of machine learning models on top of multi-model globally-distributed data for time series forecasting, deep learning, predictive analytics, fraud detection and many other use-cases.

Azure Cosmos DB’s commitment

We are committed to making Azure Cosmos DB the best globally distributed database for all businesses and modern applications. With Azure Cosmos DB, we believe that you will be able to write amazingly powerful, intelligent, modern apps and transform the world.

If you are using our service, please feel free to reach out to us at AskCosmosDB@microsoft.com any time. If you are not yet using Azure Cosmos DB, you can try Azure Cosmos DB for free today, no sign up or credit card is required. If you need any help or have questions or feedback, please reach out to us any time. For the latest Azure Cosmos DB news and features, please stay up-to-date by following us on Twitter #CosmosDB, @AzureCosmosDB. We look forward to see what you will build with Azure Cosmos DB!

Download the full Forrester report and learn more about Azure Cosmos DB.

How to Unlock the Full Power of Apache Cassandra™

$
0
0

Feed: Blog Post – Corporate – DataStax.
Author: Louise Westoby, Senior Director, Product Marketing.

Open source software has become a natural go-to as enterprises seek to scale efficiently and affordably. Drawn by the allure of frictionless acquisitions, vendorless negotiations, and the open source appeal to potential new talent, companies of all types and sizes now use open source technology to drive growth and innovation.

Likewise, Apache Cassandra™ adoption has increased significantly since Facebook open sourced the project in 2008. Designed for the modern world, Cassandra enables organizations to build transformative applications that have outgrown the capabilities of traditional relational databases. As a result, companies that move to Cassandra can increase employee productivity and adapt to unexpectedly high traffic without fear of performance degradation, as well as implement new, game-changing use cases such as providing more personalized customer experiences.

While Cassandra is a powerful NoSQL database, there can be some barriers to adoption as well as hidden costs associated with maintaining and supporting the open source software.

Many organizations that would like to use Cassandra find themselves in a tricky position because there aren’t enough skilled NoSQL developers to keep pace with demand. One recent survey, for example, revealed that only 8% of respondents believe that there’s enough Cassandra-ready talent on the market. Other issues with Cassandra uptake include rising maintenance costs, ad hoc support, and lack of internal expertise, leading enterprises to end up spending way more than they anticipated on something they saw as a cost-cutter.

The Easiest Way to Unlock the Full Power of Cassandra

To get the most out of Cassandra, many organizations hire and train in-house talent (or, more likely, hire in-house talent and have them train themselves).

But most organizations will only be able to unlock the full potential of Cassandra by adding significant cost, complexity, and administrative burden to the mix—and that’s only in a best-case scenario. Who knows how long it might take for a team to learn their way around Cassandra and start building effectively on top of it?

This is precisely why we created DataStax Distribution of Apache Cassandra, a production-ready implementation of the database that is 100% compatible with open source Cassandra and fully supported by DataStax. The solution enables companies to unlock the full power of Cassandra without having to incur the expenses, risk, or unnecessary administrative complexities associated with adopting open source technology.

DataStax Distribution of Apache Cassandra: Key Features

Here’s what you can expect to get right out of the box when you choose DataStax Distribution of Apache Cassandra:

1. Production-ready Cassandra

The software is ready for production right away thanks to a rigorous QA and testing process. Since hotfixes, bug escalations, and upgrades are included, time to resolution accelerates and maintenance costs decrease. Also—you get the added value of the Apache Kafka Connector, the DataStax Bulk Loader, and Docker images for development.

2. 100% open source compatibility

At DataStax, we love all things open source. In fact, we also contribute to several other open source projects, including Apache TinkerPop™.

To this end, we’re pleased to announce that DataStax Distribution of Apache Cassandra is 100% compatible with open source Cassandra from both a storage and application perspective. As a result, you get to use stable, production-ready software without getting locked into any one distribution.

3. Support from Cassandra experts

Instead of relying on a motley crew of systems integrators, in-house resources, and third-party vendors spread out across time zones, companies that use DataStax Distribution of Apache Cassandra can choose 24×7 or 8×5 support from the team behind the majority of commits to the Cassandra project. This makes it easier to meet SLAs while reducing support costs.

Also—in many cases companies already have both Cassandra and DataStax Enterprise installed, but still require support for Cassandra. DataStax Distribution of Apache Cassandra enables these companies to take advantage of one support framework across their entire Cassandra implementation without compromising on the benefits of the open source software.

4. Free educational tools and resources

Organizations that use DataStax Distribution of Apache Cassandra also benefit from DataStax’s robust educational resources that are freely available to customers through DataStax Academy.

Developers who need help operating, configuring, and using Cassandra will find a wealth of resources there designed to enable more effective Cassandra utilization. DataStax also provides a rich set of tools and hosts events—like DataStax Accelerate, the world’s premier Apache Cassandra conference—to help teams become even more productive with the NoSQL database.

There’s no sense in making it any harder than it has to be to use Cassandra productively. Instead of starting from scratch and managing everything in-house, many of today’s leading organizations are starting to move to DataStax Distribution of Apache Cassandra. That way, they get a powerful, production-ready Cassandra distribution with way less hassle and for a lot less money.

eBook: The Untold Story of Apache Cassandra™

READ NOW


SHARE THIS PAGE

Enforce Centralized Tag Compliance Using AWS Service Catalog, DynamoDB, Lambda, and CloudWatch Events

$
0
0

Feed: AWS Partner Network (APN) Blog.
Author: Sagar Khasnis.

By Sagar Khasnis, Partner Solutions Architect at AWS
By Awaneendra Tiwari, Cloud Architect at Brillio

Some of the customers we work with have a central database where they keep tag values, and they want to enforce tags at provisioning using the tag enforcement capability of AWS Service Catalog.

For example, a customer can have a central location where they keep all of their constantly-updated cost center information, and they use that information as tag values in their AWS Service Catalog portfolios.

In this post, we’ll show you how to implement tag options so the tag option library is automatically updated when new tags are added to Amazon DynamoDB. Using this solution ensures that all the AWS Service Catalog products launched by end users will be automatically tagged with a standard set of values defined by your organization.

Brillio is an AWS Partner Network (APN) Advanced Consulting Partner. Together with AWS, they have worked to restructure and reshape billion-dollar enterprise IT operations into modern and agile digital infrastructure operations.

For example, Brillio is one of five APN Partners featured in the recent AWS WorkLink launch that aims to help companies accelerate secure enterprise mobility.

AWS Services and Definitions

Below is a brief review of the AWS services you’ll need, along with a few specialty terms we’ll be using throughout the post that are required to understand this tag enforcement solution.

AWS Service Catalog enables organizations to create and manage catalogs of IT services that are approved for use on AWS. The following are key concepts relating to AWS Service Catalog:

  • A Service Catalog product is a blueprint for building your AWS resources that you want to make available for deployment on AWS, along with the configuration information. You create a product by importing an AWS CloudFormation template or AWS Marketplace AMI and copying the product to AWS Service Catalog. A product can belong to multiple portfolios. To learn more, see the documentation.
    .
  • portfolio is a collection of products together with the configuration information. You can use portfolios to manage user access to specific products, and you can grant portfolio access at an AWS Identity and Access Management (IAM) user, group, and role level. To learn more, see the documentation.
    .
  • provisioned product is a CloudFormation stack with the AWS resources that are created. When an end user launches a product, AWS Service Catalog provisions the product in the form of a CloudFormation stack. To learn more, see the documentation.
    .
  • Constraints control the way users deploy a product. With launch constraints, you can specify a role that AWS Service Catalog can assume to launch a product from the portfolio. To learn more, see the documentation.
    .
  • A TagOption is a key-value pair managed in AWS Service Catalog. It’s not an AWS tag but serves as a template for creating one based on the TagOption. These TagOptions are applied to provisioned products as AWS tags.

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

AWS Lambda is a compute service that lets you run code without provisioning or managing servers.

Amazon Cloudwatch Events deliver a near real-time stream of system events that describe changes in AWS resources. Using simple rules that you can set up in minutes, you can easily route each type of event to one or more targets.

Solution Overview

Here is an architecture diagram of the tag enforcement process:

Tag Enforcement-1

Figure 1 – Tag enforcement architecture using AWS Service Catalog, Lambda, DynamoDB, and CloudWatch Events.

This sample solution will help you set up tag enforcement in your AWS environment and perform the following processes:

  • Sync newly-added tags in the DynamoDB table
    • When you add a new tag pair in DynamoDB, it will trigger the TagSync Lambda function and create the same tag pair in your TagOption library.
    • Additionally, the TagOption created in the previous step will be associated with all your existing portfolios.
  • Sync removal of tags from the DynamoDB table
    • When you remove a tag pair from DynamoDB, it will trigger the TagSync Lambda function, disassociate the corresponding TagOption from all the associated portfolios, and remove it from your TagOption library.
  • Sync updated tags in the DynamoDB table
    • When you update a tag pair in DynamoDB, it will trigger the TagSync Lambda function, update the corresponding TagOption your TagOption library and all the associated portfolios.
  • Apply all the TagOptions to new portfolios automatically
    • When you create a new AWS Service Catalog portfolio, a Cloudwatch event will trigger the TagEnforcement Lambda function and associate all the TagOptions from your TagOption library to the newly-created portfolio.

This mechanism ensures that any tags added in DynamoDB are added to your AWS Service Catalog TagOption library and attached to existing portfolios. Additionally, any newly-created portfolios will automatically contain all the TagOptions in your TagOption library.

Setup

There are a few basic actions to take before allowing automatic tag enforcement for your AWS Service Catalog portfolios. The following steps require administrator access to AWS resources via the AWS Management Console.

Launch Stack

  • On the CloudFormation details screen, the following parameters will be listed. You can use the default values listed for these parameters and click Next:
    • BucketName: Name of the Amazon Simple Storage Service (Amazon S3) bucket containing the Lambda code for the two Lambda functions used in this sample: tagEnforcement.zip and tagSync.zip. You can use the default value listed.
    • TagSyncLambdaName: The filename of the tag sync Lambda function (which syncs AWS Service Catalog TagOptions with DynamoDB) in your Amazon S3 bucket. You can use the default value listed.
    • TagEnforcementLambdaName: The filename of the tag enforcement Lambda function in your Amazon S3 bucket. You can use the default value listed.
    • RoleName: Enter the name of the execution role created for the tag enforcement and sync Lambda functions. You can use the default value listed.
    • CloudWatchRuleName: Enter the name of the Amazon CloudWatch rule that triggers the TagSyncLambda on creation of a new AWS Service Catalog portfolio. You can use the default value listed.
  • Click Next for the Specify Details and Options pages, and then click on the checkbox to acknowledge creation of AWS Identity and Access Management (IAM) resources.
  • Finally, click Create to setup your tag enforcement infrastructure.

All of the artifacts for this solution are available in the aws-samples github repository.

Test Your Solution

Follow these steps to ensure your tag enforcement automation is running accurately:

Step 1: Add tag values to DynamoDB

  • Go to the DynamoDB table created from the CloudFormation template in the initial setup section. In the items tab, add the following values:
    • KEY=Name, VALUE=Tag-Enforcement
    • KEY=Team, VALUE=Operations
    • KEY=Cost Center, VALUE=100
    • KEY=Cost Center, VALUE=200

Tag Enforcement-2

Figure 2 – How your DynamoDB table should look after entering the tag values.

Step 2: Check the TagOption library for latest tags from DynamoDB

  • Step 1 will trigger the TagSync Lambda function, which copies the values from DynamoDB to the AWS Service Catalog TagOption library.
  • At this point, the TagOption library in your AWS Service Catalog should automatically contain the tagOptions from your DynamoDB table, as shown in the screen shot below.

Tag Enforcement-3

Figure 3 – AWS Service Catalog tagOption library with your DynamoDB tags.

Step 3: Create an AWS Service Catalog portfolio

  • Create an AWS Service Catalog portfolio called ‘Demo-Portfolio’ in the Region where you setup the tag enforcement automation infrastructure.
  • This step triggers the CloudWatch Event for triggering the tag enforcement Lambda function, which applies all the tags in your TagOption library to the new portfolio.

Tag Enforcement-4.1

Tag Enforcement-4.2

Figure 4 – The process to create a new portfolio in AWS Service Catalog.

Step 4: Add new tags to DynamoDB

  • Go to the DynamoDB table for your tags and add the following value to the table:
    • KEY=Operation, VALUE=Addition

Tag Enforcement-5

Figure 5 – How your DynamoDB table would look after adding a new tag.

Step 5: Check the newly-created portfolio for additions in DynamoDB

  • Go back to the ‘Demo-Portfolio’ to check the tagOptions associated with it.
  • Your newly-created tag in DynamoDB will be associated with the AWS Service Catalog portfolio, as shown in the image below.

Tag Enforcement-6

Figure 6 – AWS Service Catalog portfolio updated with newly added tags in DynamoDB.

Step 6: Launch a product from the AWS Service Catalog portfolio as an end user

  • We added a sample product named “Amazon Elastic Compute Cloud (EC2) Windows” and assigned it to an end user. You can assign IAM users, groups, or roles to a portfolio on the portfolio details screen in the previous step.
  • When the end user launches the EC2 product, it will be attached with tagOptions from the previous step.

Tag Enforcement-7

Figure 7 – The tagOptions screen in AWS Service Catalog during product launch.

Notes

Please note the following considerations while using this sample solution:

  • It will not import existing TagOptions from AWS Service Catalog to DynamoDB.
  • It assumes all the TagOptions in your TagOption library will be created in DynamoDB and does not consider cases where TagOptions are manually created in AWS Service Catalog.
  • It will not interfere with manually-created TagOptions unless the same TagOptions are created in the DynamoDB table.

You may want to extend this solution to support portfolio-specific tags, which can be done by adding a portfolio id column in DynamoDB and updating the logic in the TagSync and TagEnforcement Lambda functions.

If you have an idea on how this solution could be extended, we would love to hear your ideas at aws-sa-servicecatalog@amazon.com.

Conclusion

In this post, we have provided an example of how you can dynamically update your AWS Service Catalog tag option library using Amazon DynamoDB. We showed you how to deploy the solution architecture using a predefined CloudFormation template, as well as how to run a test scenario to ensure the DynamoDB tags were automatically synced with your AWS Service Catalog TagOptions library and portfolios.

Additionally, we showed the possible extensions you could apply to this solution for your unique compliance use cases. This exercise aims to provide you a head start to your tag compliance strategy and ensure all the AWS Service Catalog products launched by your end users will be tagged with a standard set of values defined by your organization.

If you have questions about implementing the solution described in this post, please contact AWS Support.

.


Brillio APN Logo-1
Connect with Brillio-1

Brillio – APN Partner Spotlight

Brillio is an APN Advanced Consulting Partner. They help customers re-imagine their businesses and competitive advantages, and then rapidly develops and deploys disruptive industrial-grade digital solutions.

Contact Brillio | Practice Overview

*Already worked with Brillio? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.

Azure.Source – Volume 75

$
0
0

Feed: Microsoft Azure Blog.
Author: Rob Caron.

Preview | Generally available | News & updates | GTC 2019 | Technical content | Azure shows | Events | Customers, partners, and industries

Now in preview

Windows Virtual Desktop now in public preview on Azure

The public preview of the Windows Virtual Desktop service is now available on Azure. Customers can now access the only service that delivers simplified management, multi-session Windows 10, optimizations for Office 365 ProPlus, and support for Windows Server Remote Desktop Services (RDS) desktops and apps. With Windows Virtual Desktop, you can deploy and scale your Windows desktops and apps on Azure in minutes, while enjoying built-in security and compliance. Access to Windows Virtual Desktop is available through applicable RDS and Windows Enterprise licenses.

Thumbnail from What is Windows Virtual Desktop? by Microsoft Mechanics

Azure Data Studio: An Open Source GUI Editor for Postgres

Support for PostgreSQL in Azure Data Studio is now available in preview. Azure Data Studio is a cross-platform modern editor focused on data development. It’s available for Linux, MacOS, and Windows. We’re also introducing a corresponding preview PostgreSQL extension in Visual Studio Code (VS Code). Both Azure Data Studio and Visual Studio Code are open source and extensible – two things that PostgreSQL itself is based on. If your primary use case is data, choose Azure Data Studio to manage multiple database connections, explore database object hierarchy, set up dashboards, and more.

Azure Container Registry virtual network and Firewall rules preview support

Announcing Azure Container Registry (ACR) now supports limiting public endpoint access. Customers can now limit registry access within an Azure Virtual Network (VNet), as well as whitelist IP addresses and ranges for on-premises services. VNet and Firewall rules are now supported with virtual machines (VM) and Azure Kubernetes Services (AKS). VNet and Firewall rules are available for public preview in all 25 public cloud regions. General availability (GA) will be based on a curve of usage and feedback.

Also available in preview

Now generally available

Azure Backup for SQL Server in Azure Virtual Machines now generally available

Now generally available, Azure Backup for SQL Server Virtual Machines (VMs), an enterprise scale, zero-infrastructure solution that eliminates the need to deploy and manage backup infrastructure while providing a simple and consistent experience to centrally manage and monitor the backups on standalone SQL instances and Always On Availability Groups. Built into Azure, the solution combines the core cloud promises of simplicity, scalability, security and cost effectiveness with inherent SQL backup capabilities that are leveraged by using native APIs, to yield high fidelity backups and restores.

Thumbnail from How to back up SQL Server running in Azure VMs with Azure Backup

Also generally available

News and updates

Microsoft’s Azure Cosmos DB is named a leader in the Forrester Wave: Big Data NoSQL

Announcing that Forrester has named Microsoft as a Leader in The Forrester Wave™: Big Data NoSQL for the first quarter of 2019 based on their evaluation of Azure Cosmos DB validating the exceptional market momentum and customer satisfaction. According to Forrester, “half of global data and analytics technology decision makers have either implemented or are implementing NoSQL platforms, taking advantage of the benefits of a flexible database that serves a broad range of use cases.” We are committed to making Azure Cosmos DB the best globally distributed database for all businesses and modern applications. With Azure Cosmos DB, we believe that you will be able to write amazingly powerful, intelligent, modern apps and transform the world.

Marketing graphic for Azure Cosmos DB

March 2019 changes to Azure Monitor Availability Testing

Azure Monitor Availability Testing allows you to monitor the availability and responsiveness of any HTTP or HTTPS endpoint that is accessible from the public internet. At the end of this month we are deploying some major changes to improve performance and reliability, as well as to allow us to make more improvements for the future. This post highlights and describes some of the changes needed to ensure that your tests continue running without any interruption.

Data integration with ADLS Gen2 and Azure Data Explorer using Data Factory

Introducing the latest integration in Azure Data Factory. Azure Data Lake Storage Gen2 is a data lake platform that combines advanced data lake solutions with the economic, global scale, and enterprise grade security of Azure Blob Storage. Azure Data Explorer is a fully-managed data integration service to operationalize and manage the ETL/ELT flows with flexible control flow, rich monitoring, and continuous integration and continuous delivery (CI/CD) capabilities. You can now meet the advanced needs of your analytics workloads with unmatched price performance and the security of one of the best clouds for analytics.

Additional news and updates

News from NVIDIA GPU Technology Conference

Over the years, Microsoft and NVIDIA have helped customers run demanding applications on GPUs in the cloud. Last week at NVIDIA GPU Technology Conference 2019 (GTC 2019) in San Jose, Microsoft made several announcements on our collaboration with NVIDIA to help developers and data scientists deliver innovation faster.

Microsoft and NVIDIA extend video analytics to the intelligent edge

Microsoft and NVIDIA are partnering on a new approach for intelligent video analytics at the edge to transform raw, high-bandwidth videos into lightweight telemetry. This delivers real-time performance and reduces compute costs for users. With this latest collaboration, NVIDIA DeepStream and Azure IoT Edge extend the AI-enhanced video analytics pipeline to where footage is captured, securely and at scale. Now, our customers can get the best of both worlds—accelerated video analytics at the edge with NVIDIA GPUs and secure connectivity and powerful device management with Azure IoT Edge and Azure IoT Hub.

Edge appliance and Azure cloud diagram

Microsoft and NVIDIA bring GPU-accelerated machine learning to more developers

GPUs have become an indispensable tool for doing machine learning (ML) at scale. Our collaboration with NVIDIA marks another milestone in our venture to help developers and data scientists deliver innovation faster. We are committed to accelerating the productivity of all machine learning practitioners regardless of their choice of framework, tool, and application. Two of the integrations that Microsoft and NVIDIA have built together to unlock industry-leading GPU acceleration for more developers and data scientists are covered in the next two posts.

Azure Machine Learning service now supports NVIDIA’s RAPIDS

Azure Machine Learning service is the first major cloud ML service to support NVIDIA’s RAPIDS, a suite of software libraries for accelerating traditional machine learning pipelines with NVIDIA GPUs. With RAPIDS on Azure Machine Learning service, users can accelerate the entire machine learning pipeline, including data processing, training and inferencing, with GPUs from the NC_v3, NC_v2, ND or ND_v2 families. Azure Machine Learning service users are able to use RAPIDS in the same way they currently use other machine learning frameworks, and can use RAPIDS in conjunction with Pandas, Scikit-learn, PyTorch, TensorFlow, etc.

ONNX Runtime integration with NVIDIA TensorRT in preview

Announcing the open source the preview of the NVIDIA TensorRT execution provider in ONNX Runtime. Taking another step towards open and interoperable AI by enabling developers to easily leverage industry-leading GPU acceleration regardless of their choice of framework, developers can now tap into the power of TensorRT through ONNX Runtime to accelerate inferencing of ONNX models, which can be exported or converted from PyTorch, TensorFlow, and many other popular frameworks.

Technical content

Reducing security alert fatigue using machine learning in Azure Sentinel

Alert fatigue is real. Security analysts face a huge burden of triage as they not only have to sift through a sea of alerts, but to also correlate alerts from different products manually. Machine learning (ML) in Azure Sentinel is built-in right from the beginning and focuses on reducing alert fatigue while offering ML toolkits tailored to the security community; including ML innovations aimed at making security analysts, security data scientists, and engineers productive.

Screenshot of Fusion and two composite alerts

Breaking the wall between data scientists and app developers with Azure DevOps

Data scientists are used to developing and training machine learning models for their favorite Python notebook or an integrated development environment (IDE). The app developer is focused on the application lifecycle – building, maintaining, and continuously updating the larger business application. As AI is infused into more business-critical applications, it is increasingly clear that we need to collaborate closely to build and deploy AI-powered applications more efficiently. Together, Azure Machine Learning and Azure DevOps enable data scientists and app developers to collaborate more efficiently while continuing to use the tools and languages that are already familiar and comfortable.

Step-By-Step: Getting Started with Azure Machine Learning

In this comprehensive guide, Anthony Bartolo explains how to set up a prediction model using Azure Machine Learning Studio. The example comes from a real-life hackfest with Toyota Canada and predicts the pricing of vehicles.

Intro to Azure Container Instances

Aaron Powell covers using Azure Container Instances to run containers in a really simple approach in this handy introductory guide. It walks through a hello world demo and then some advanced scenarios on using ACR and connecting to Azure resources (such as a SQL server).

Getting started with Azure Monitor Dynamic Thresholds

This overview from Sonia Cuff discusses the new Dynamic Thresholds capability in Azure Monitor, where machine learning sets the alert threshold levels when you are monitoring metrics (e.g., CPU percentage use in a VM or HTTP request time in an application).

How to Deploy a Static Website Into Azure Blob Storage with Azure DevOps Pipeline

In this third video and blog post of Frank Boucher’s CI/CD series, he creates a release Pipeline and explains how to deploy an ARM template to create or update Azure resources and deploy a static website into a blob storage. This series explains how to build a continuous integration and continuous deployment system using Azure DevOps Pipeline to deploy a Static Website into Azure Bob Storage.

Thumbnail from How to Deploy a Static Website Into Azure Blob Storage with Azure DeOps Pipeline - part 3

Fixing Azure Functions and Azure Storage Emulator 5.8 issue

If you happen to run into an error after updating Azure Functions to the latest version, Maxime Rouille’s hotfix is for you. He not only explains what to do, but why this might happen and what his hot fix does.

Using Key Vault with Your Mobile App the Right Way

You know you need to keep your app’s secrets safe and follow best practices – but how? In this post, Matt Soucoup uses a Xamarin app to walk through how and why to use Azure Key Vault, Active Directory, and Azure Functions to keep application secrets off of your mobile app and in Key Vault – without tons of extra work for you.

How do You Structure Your Code When Moving Your API from Express to Serverless Functions?

There are a lot of articles showing how to use serverless functions for a variety of purposes. A lot of them cover how to get started, and they are very useful. But what do you do when you want to organize them a bit more as you do for your Node.js Express APIs? There’s a lot to talk about on this topic, but in this post, John focuses specifically on one way you can organize your code.

Securely monitoring your Azure Database for PostgreSQL Query Store

Long running queries may interfere with the overall database performance and likely get stuck on some background process, which means that from time to time you need to investigate if there are any queries running indefinitely on your databases. See how you can set up alerting on query performance-related metrics using Azure Functions and Azure Key Vault.

Expanded Jobs functionality in Azure IoT Central

We have improved device management workflow through additional jobs functionalities that make managing your devices at scale much easier. In this brief post, learn to copy an existing job, save a job to continue working on later, stop or resume a running job, and download a job details report once your job has completed running.

Screenshot showing an example of download details of Jobs in Azure IoT Central

Azure Stack IaaS – part five

Self-service is core to Infrastructure-as-a-Service (IaaS). Azure’s IaaS gives the owner of the subscription everything they need to create virtual machines (VMs) and other resources on their own, without involving an administrator. This post shows a few examples of Azure and Azure Stack self-service management of VMs.

Additional technical content

Azure shows

Episode 271 – Azure Stack – Tales from the field | The Azure Podcast

Azure Stack experts from Microsoft Services, Heyko Oelrichs and Rathish Ravikumar, give us an update on Azure Stack and some valuable tips and tricks based on their real-world experiences deploying it for customers.


Read the transcript

One Dev Question: What new HoloLens and Azure products were released in Barcelona? | One Dev Minute

In this episode of One Dev Question, Alex Kipman discusses Microsoft Mixed Reality, featuring announcements from Mobile World Congress.

Data Driven Styling with Azure Maps | Internet of Things Show

Ricky Brundritt, PM in the Azure Maps team, walks us through data driven styling with Azure Maps. Data driven styling allows you to dynamically style layers at render time on the GPU using properties on your data. This provides huge performance benefits and allows large datasets to be rendered on the map. Data driven style expressions can greatly reduce the amount of code you would normally need to write and define this type of business logic using if-statements and monitoring map events.

New Microsoft Azure HPC Goodness | Tuesdays with Corey

Corey Sanders and Evan Burness (Principal Program Manager on the Azure Compute team) sat down to talk about new things in the High Performance Computing space in Azure.

Get ready for Global Azure Bootcamp 2019 | Azure Friday

Global Azure Bootcamp is a free, one-day, local event that takes place globally. It’s an annual event run by the Azure community. This year, Global Azure Bootcamp is on Saturday, April 27, 2019. Event locations range from New Zealand to Hawaii and chances are good that you can find a location near you. You may even organize your own local event and receive sponsorship so long as you register by Friday, March 29, 2019. Join in to receive sponsorship, Azure passes, Azure for Students, lunch, and a set of content to use for the day.

How to use Azure Monitor Application Insights to record custom events | Azure Tips and Tricks

Learn how to use Azure Monitor Application Insights to make your application logging smarter with Custom Event Tracking.

Thumbnail from How to use Azure Monitor Application Insights to recordcustom events by Azure Tips and Tricks

How to create an Azure Kubernetes Service cluster in the Azure Portal | Azure Portal Series

The Azure Portal enables you to get started quickly with Kubernetes and containers. In this video, learn how to easily create an Azure Kubernetes Service cluster.

Thumbnail from How to create an Azure Kubernetes Service cluster in theAzure Portal by AzurePortal Series

Phil Haack on DevOps at GitHub | Azure DevOps Podcast

Jeffrey Palermo and Phil Haack take a dive deep into DevOps at GitHub. They talk about Phil’s role as Director of Engineering; how GitHub, as a company, grew while Phil worked there; the inner workings of how the GitHub website ran; and details about how various protocols, continuous integration, automated testing, and deployment worked at GitHub.

Episode 3 – ExpressRoute, Networking & Hybridity, Oh My! | AzureABILITY

AzureABILITY host Louis Berman discusses ExpressRoute, networking and hybridity with Microsoft’s Bryan Woodworth, who specializes in networking, connectivity, high availability, and routing for hybrid workloads in Azure.


Read the transcript

Events

Microsoft Azure for the Gaming Industry

Cloud computing is increasingly important for today’s global gaming ecosystem, empowering developers of any size to reach gamers in any part of the world. In this wrap-up post from GDC 2019, learn how Azure and PlayFab are a powerful combination for game developers. Azure brings reliability, global scale, and enterprise-level security, while PlayFab provides Game Stack with managed game services, real-time analytics, and comprehensive LiveOps capabilities.

The Value of IoT-Enabled Intelligent Manufacturing

Learn how you can apply insights from real-world use cases of IoT-enabled intelligent manufacturing when you attend the Manufacturing IoT webinar on March 28th: Go from Reaction to Prediction – IoT in Manufacturing. In addition, you’ll learn how you can use IoT solutions to move from a reactive to predictive model. For additional hands-on, actionable insights around intelligent edge and intelligent cloud IoT solutions, join us on April 19th for the Houston Solution Builder Conference.

Customers, partners, and industries

Power IoT and time-series workloads with TimescaleDB for Azure Database for PostgreSQL

Announcing a partnership with Timescale that introduces support for TimescaleDB on Azure Database for PostgreSQL for customers building IoT and time-series workloads. TimescaleDB allows you to scale for fast ingest and complex queries while natively supporting full SQL and has a proven track record of being deployed in production in a variety of industries including oil & gas, financial services, and manufacturing. The partnership reinforces our commitment to supporting the open-source community to provide our users with the most innovative technologies PostgreSQL has to offer.


This Week in Azure – 22 March 2019 | A Cloud Guru – Azure This Week

Lars is away this week and so Alex Mackey covers Azure Portal improvements, the Azure mobile app, Azure SQL Data Warehouse Workload Importance and Microsoft Game Stack.

Thumbnail from This Week in Azure - 22 March 2019 by A Cloud Guru- Azure This Week

Apache Cassandra™: Five Interesting Facts

$
0
0

Feed: Blog Post – Corporate – DataStax.
Author: Louise Westoby, Senior Director, Product Marketing.

Apache Cassandra is the NoSQL database that powers several of today’s leading organizations, including Instagram, GoDaddy, and Netflix.

The database was developed to meet the needs of today’s most demanding applications, which have outgrown the capabilities of traditional relational databases.

While much has been written about Cassandra over the years, there are still some things some people may not be aware of.

1. Cassandra started at Facebook

When Facebook began growing rapidly, no database on the market could meet the social media juggernaut’s performance and scalability requirements.

So in 2007, two Facebook engineers—Avinash Lakshman and Prashant Malik—began developing Cassandra to power the social network’s inbox search feature using large datasets across multiple servers.

In July 2008, Facebook released Cassandra as an open source project. And in January 2009, Cassandra became an Apache Incubator project.

2. Cassandra recently turned 10

In July 2018, Cassandra officially turned 10, and the database has enjoyed tremendous growth since its birth, with major enterprises across the globe now using it to build and deploy powerful applications in hybrid cloud environments.

3. Cassandra’s name came from Greek Mythology

Lakshman and Malik decided to name their project Cassandra for a very specific reason.

In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. According to legend, the god Apollo was in love with Cassandra. To win her heart, he gave her the ability to see the future, which she accepted. After receiving the gift, she told Apollo she didn’t want to be with him anymore. So Apollo cursed her: Cassandra would be able to see the future, but no one would believe any of her prophecies.

Cassandra, therefore, became known as the “cursed oracle.” We’ll let you draw your own conclusions on that one.

4. DataStax is behind the majority of Cassandra commits

Cassandra is an open source project, but  DataStax has been the driving force behind it. The company is responsible for the majority of all the commits to the project (i.e., when temporary code is made permanent).  Read the full story about how this came to be here.

5. Apple has the biggest Cassandra instance

In an age where customers expect highly positive and personalized user experiences in each interaction and organizations are always trying to increase internal productivity, scalability has never been more important.

Simply put, Cassandra makes it easy to scale.

The database’s masterless architecture supports linear scalability for read and write operations; if you want to double performance, just double the number of nodes.

How scalable is Cassandra?

In 2014, Apple announced its Cassandra instance had more than 75,000 nodes and stored more than 10 petabytes of data. What’s more, a single cluster was over 1,000 nodes.

The company says that it processes millions of read and write operations per second on a regular basis.

That was five years ago. Who knows how big Apple’s instance has grown since then?

At DataStax, we’ve enjoyed watching Cassandra grow in adoption and sophistication over the last 10 years. But we’re even more excited to see how the powerful database management system grows over the next decade—particularly following the upcoming release of Cassandra 4.0.

Stay tuned! We’re just getting started.

eBook: The 5 Main Benefits of Apache Cassandra™

READ NOW


SHARE THIS PAGE

From enterprise to edge: embeddable databases unleash new capabilities

$
0
0

Feed: IBM Big Data & Analytics Hub – All Content
;
Author: milind-tamaskar
;

Delivering meaningful analytics in an efficient, reliable way that takes data of all types into account remains a challenge for many organizations. This is particularly true for localized analytics performed at the Internet of Things (IoT) edge. Embeddable, small-footprint databases are making this task easier by providing distributed edge analytics with the reliability and security organizations require.

Patrice Favenecc of Hilscher spoke about an example at Think 2019. Hilscher will deploy its netIOT gateway for L’Oréal, which was looking for better packaging automation. Using the embedded IBM Informix database, Hilscher will help L’Oréal utilize the industrial automation data within the edge IoT gateways on its shop floor, helping deliver insights faster. This will promote greater operational efficiency. Having more rapid insight into the production line will mean L’Oréal could better adapt to meet changing consumer demand with higher quality.

Embeddable databases are upgrading to make processes smoother and even more valuable. IBM announced Informix V14.10 at Think 2019, revealing a host of new capabilities to clients and partners in attendance. Now, after beta testing with more than 25 customers and partners, it is available to the public. IBM Informix support for time series and spatial data, its reliability with continuous availability and cost-effective administration have been augmented to address database challenges and opportunities from the core of the enterprise to the edge of the network.

Less time wasted

Even databases with powerful analytics should be easy to install and use. Otherwise, database administrators (DBAs) waste time on low-value, repetitive tasks instead of more significant projects. To this end, automation, simplified processes and an intuitive user interface are key, particularly when the database must be installed and used across many distributed locations or edge devices.

With its support on Docker, IBM Informix helps developers get started without manual installation procedures through its Innovator-C and Developer editions. Once installed, support for SQL Common Table Expression helps application developers improve the readability and maintenance of complex queries and write powerful recursive queries, helping achieve deeper insights with less effort.

Monitoring databases and collaboration is also easier thanks to InformixHQ, which shows how key performance metrics change over time. It also tracks how efficiently IBM Informix is running workloads even when a user has stepped away from the screen. Customizable alerts can notify users of issues through email, Twilio, or PagerDuty. DBAs can focus on more important activities without fear of missing an important update.

Analytics, security and reliability

Whether a database is embedded across thousands of IoT devices or sits at the heart of a multibillion-dollar business, one thing is constant: powerful analytics delivered securely and reliably are essential. Having the wrong type of analytic capabilities or taking too long to process them could result in missed opportunities and lost revenue. Security breaches often result in significant delays and introduce other regulatory and consumer trust issues. Organizations can mitigate these potential issues from the start with analytic capabilities that match business needs and simplified or automated security and reliability functions that help organizations prepare for the worst with minimal effort.

IBM V14.10 aims to aid in this approach with:

  • Analytics capabilities for a wide range SQL, NoSQL, geodetic and time-series data
  • Support for IoT edge gateways running the 64-bit ARM V8 current platform
  • Enhancements for time-series data such as sub-second timestamps and a new function to find missing readings for a specific IoT sensor
  • Core and memory limit increases for the Workgroup edition of IBM Informix
  • Storage compression is included in the Enterprise edition, which is capable of introducing higher performance and greater cost effectiveness through storage reduction
  • Enhancement of log replay performance of remote secondary servers by up to a factor of five, which enables client applications to sustain near-zero latency between primary and secondary servers, allowing for faster recovery time in disaster scenarios

One beta client has already reported improved ability to scale and cluster resiliency with IBM Informix V14.10. Uptime has also been increased while maintaining simplified administration thanks to IBM Informix 14.10 allowing several in-place alter enhancements, renaming indexes and constraints, performing loopback replication for table reorganizations and code set conversions, and the ability to identify client sessions through the use of labels.

Security can also receive a boost thanks to additional automated support. Remote storage of encryption at rest keys in Amazon Key Manager is now possible, reducing DBA effort to encrypt data backups. Encrypting the encryption key itself provides even greater security. Transport Layer Security (TLS) support has also been increased to 1.2 for a higher level of network security.

Learn more about IBM Informix V14.10 and how it can help you take advantage of analytics opportunities within the enterprise or at the edge of your operations in our upcoming webinar.

Testing MySQL NDB Cluster with dbdeployer

$
0
0

Feed: Planet MySQL
;
Author: Jesper Krogh
;

A great way to install MySQL when you need to do quick tests is to use a sandbox tool. This allows you to perform all the installation steps with a single command making the whole process very simple, and it allows for automation of the test. Giuseppe Maxia (also known as the Data Charmer, @datacharmer on Twitter) has for many years maintained sandbox tools for MySQL, first with MySQL Sandbox and now with dbdeployer.

One of the most recent features of dbdeployer is the support for MySQL NDB Cluster. In this blog, I will take this feature and test it. First, I will briefly discuss what MySQL NDB Cluster is, then install dbdeployer, and finally set up a test cluster.

Deploying a MySQL NDB Cluster cluster with dbdeployer.
Deploying a MySQL NDB Cluster cluster with dbdeployer.

What is MySQL NDB Cluster?

MySQL NDB Cluster is primarily an in-memory database (but also with support for on-disk data) that has been designed from day one to be highly available and providing consistent response times. A cluster consists of several nodes which can be one of three types:

  • Data Nodes: This is where the actual data is stored. Currently there is support for up to 48 data nodes in a cluster with up to 1TiB of data memory for each node.
  • API Nodes: These are the nodes where queries are executed on. An API node can be a normal mysqld process (also known as an SQL node), or it can be a NoSQL node using the C++ (this is the native NDB API), Java (ClusterJ), memcached, or Node.js API.
  • Management Nodes: These nodes hold the configuration of the cluster, and one of the management nodes is the most common choice as an arbitrator in case it is necessary decide between two halves of data nodes to avoid a split brain scenario.

You will typically have at least two data nodes in a cluster with two copies (replicas) of the data. This allows one data node to be offline while the cluster stays online. More data nodes can be added to increase the capacity or to add more data partitions. The data partitioning (sharding) and the replicas is all handled automatically, including when querying the data.

Over view of the MySQL NDB Cluster architecture.
Over view of the MySQL NDB Cluster architecture.

All of this means that you will end up with quite a few nodes. In a production cluster, you need at least two of each node type to have high availability. Even though you may not need high availability for your testing, you will still need at least two data nodes, one management node, and one SQL node. Being able to automate the installation of the cluster is a great help when you need to do a quick test – which brings us to dbdeployer. The first step is to install it.

Want to Know More about MySQL NDB Cluster?

I am one of the authors of Pro MySQL NDB Cluster (Apress) which is an almost 700 pages long book dedicated to MySQL NDB Cluster. You can buy it from Apress (print or DRM free ePub and PDF), Amazon (print and Kindle/Mobi), Barnes & Nobles (print), and other book shops.

I have also written a brief introduction to MySQL NDB Cluster – but with a little more information than above – on Apress’ blog.

Installing dbdeployer

It is simple to install dbdeployer. From the dbdeployer’s GitHub page, there are releases that can be downloaded and easily installed. For this blog, I am using release 1.24.0 on Linux. I will recommend you to use the latest release. In addition to Linux, dbdeployer is also available for macOS. Unfortunately there is no Microsoft Windows support.

An example of downloading and installing dbdeployer is:

shell$ mkdir Downloads

shell$ cd Downloads/

shell$ wget https://github.com/datacharmer/dbdeployer/releases/download/v1.24.0/dbdeployer-1.24.0.linux.tar.gz
...
HTTP request sent, awaiting response... 200 OK
Length: 4888282 (4.7M) [application/octet-stream]
Saving to: ‘dbdeployer-1.24.0.linux.tar.gz’

100%[================================>] 4,888,282   1.70MB/s   in 2.8s   

2019-03-25 17:48:54 (1.70 MB/s) - ‘dbdeployer-1.24.0.linux.tar.gz’ saved [4888282/4888282]

shell$ tar -zxf dbdeployer-1.24.0.linux.tar.gz 

shell$ mkdir ~/bin

shell$ mv dbdeployer-1.24.0.linux ~/bin/dbdeployer

shell$ export PATH=${PATH}:~/bin

This downloads and unpacks the 1.24.0 release into the Downloads directory. Then the dbdeployer binary is moved to the ~/bin directory and renamed to dbdeployer. Finally, the ~/bin directory is added to the path searched when executing a command, so it is not necessary to specify the path each dbdeployer is executed. There are other ways to perform these steps and other options where to install it; see also the official documentation.

That it is. Now it is possible to install a test cluster.

Installing a Test Cluster

Since dbdeployer works on a single host, all of the nodes will be installed on the same host. While this is bad for a production cluster, it is perfectly fine for most test clusters.

Warning

While a single host cluster is great for most tests, for testing your application before a deployment to production, it is recommended to use a multi-host cluster that is as similar to your production cluster as possible.

The first step is to download MySQL NDB Cluster as a tar-ball. You can get the latest patch release of each version from https://dev.mysql.com/downloads/cluster/. If you need to test with an older release, you can get that from https://downloads.mysql.com/archives/cluster/. In this example, MySQL NDB Cluster 7.6.9 is downloaded from the latest releases and places in the ~/Downloads directory:

shell$ cd ~/Downloads/

shell$ wget https://dev.mysql.com/get/Downloads/MySQL-Cluster-7.6/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz
...
HTTP request sent, awaiting response... 200 OK
Length: 914236523 (872M) [application/x-tar-gz]
Saving to: ‘mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz’

100%[================================>] 914,236,523  699KB/s   in 23m 52s

2019-03-25 18:49:29 (624 KB/s) - ‘mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz’ saved [914236523/914236523]

Once the download has completed, use the unpack command of dbdeployer to unpack the downloaded file:

shell$ dbdeployer unpack --prefix=ndb ~/Downloads/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz 
Unpacking tarball /home/dbdeployer/Downloads/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz to $HOME/opt/mysql/ndb7.6.9
.........100.........200.........300.........400.........500........
...
.........20300.........20400.........20500.........20600.........2070020704
Renaming directory /home/dbdeployer/opt/mysql/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64 to /home/dbdeployer/opt/mysql/ndb7.6.9

You are now ready for the actual creation of the test cluster. This is done using the deploy command:

shell$ dbdeployer deploy replication ndb7.6.9 --topology=ndb --concurrent
$HOME/sandboxes/ndb_msb_ndb7_6_9/initialize_nodes
MySQL Cluster Management Server mysql-5.7.25 ndb-7.6.9
2019-03-27 17:22:16 [ndbd] INFO     -- Angel connected to 'localhost:20900'
2019-03-27 17:22:16 [ndbd] INFO     -- Angel allocated nodeid: 2
2019-03-27 17:22:17 [ndbd] INFO     -- Angel connected to 'localhost:20900'
2019-03-27 17:22:17 [ndbd] INFO     -- Angel allocated nodeid: 3
executing 'start' on node 1
............ sandbox server started
executing 'start' on node 2
.... sandbox server started
executing 'start' on node 3
.... sandbox server started
NDB cluster directory installed in $HOME/sandboxes/ndb_msb_ndb7_6_9
run 'dbdeployer usage multiple' for basic instructions'

This creates a cluster with two data nodes, one management nodes, and three SQL nodes. The nodes have been installed in the ${HOME}/sandboxes/ndb_msb_ndb7_6_9/ directory:

shell$ ls sandboxes/ndb_msb_ndb7_6_9/
check_nodes          ndb_conf  node3               test_replication
clear_all            ndb_mgm   restart_all         test_sb_all
cluster_initialized  ndbnode1  sbdescription.json  use_all
initialize_nodes     ndbnode2  send_kill_all       use_all_masters
n1                   ndbnode3  start_all           use_all_slaves
n2                   node1     status_all
n3                   node2     stop_all

Notice how there for example is an ndb_mgm script. This is a wrapper script around the ndb_mgm binary in the MySQL installation – the MySQL NDB Cluster management client. This makes it easy to connect to the management node, for example to check the status of the cluster:

shell$ ./sandboxes/ndb_msb_ndb7_6_9/ndb_mgm -e "SHOW"
Connected to Management Server at: localhost:20900
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @127.0.0.1  (mysql-5.7.25 ndb-7.6.9, Nodegroup: 0, *)
id=3    @127.0.0.1  (mysql-5.7.25 ndb-7.6.9, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (mysql-5.7.25 ndb-7.6.9)

[mysqld(API)]   4 node(s)
id=4    @127.0.0.1  (mysql-5.7.25 ndb-7.6.9)
id=5    @127.0.0.1  (mysql-5.7.25 ndb-7.6.9)
id=6    @127.0.0.1  (mysql-5.7.25 ndb-7.6.9)
id=7 (not connected, accepting connect from localhost)

Before wrapping up, let’s see how you can connect to the different SQL nodes and see how they indeed query the same data.

Testing the Cluster

As a simple test, connect to the first SQL node and create a table. Then, connect to the second SQL node and insert a row. Finally, connect to the third SQL node and query the data.

The SQL nodes are in the node* directories in ${HOME}/sandboxes/ndb_msb_ndb7_6_9/. Each of those work in the same way as for a standalone MySQL Server sandbox, so you can use the use wrapper script to connect using the MySQL command-line client:

shell$ ./sandboxes/ndb_msb_ndb7_6_9/node1/use 
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 7
Server version: 5.7.25-ndb-7.6.9-cluster-gpl-log MySQL Cluster Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

node1 [localhost:27510] {msandbox} ((none)) > 

Now, the table can be created (output has been reformatted):

node1 [localhost:27510] {msandbox} ((none)) > R
Returning to default PROMPT of mysql> 

mysql> CREATE SCHEMA db1;
Query OK, 1 row affected (0.42 sec)

mysql> CREATE TABLE db1.t1 (
          id int unsigned NOT NULL auto_increment PRIMARY KEY,
          val varchar(36)
       ) ENGINE=NDBCluster;
Query OK, 0 rows affected (2.64 sec)

I changed the prompt back to the default mysql> prompt. This is not because I don’t like the prompt created by dbdeployer, but simply to make the formatting of the queries nicer. In general, I do prefer my prompt to tell me where I am connected, so the normal dbdeployer prompt will otherwise work well for me.

The table creation is just like normal except the engine is set to NDBCluster. This is the engine name that tells MySQL to create the table in the data nodes.

The second step is to connect to the second instance and insert a row:

node2 [localhost:27511] {msandbox} ((none)) > INSERT INTO db1.t1 (val) VALUES (UUID());
Query OK, 1 row affected (0.11 sec)

node2 [localhost:27511] {msandbox} ((none)) > SELECT * FROM db1.t1;
+----+--------------------------------------+
| id | val                                  |
+----+--------------------------------------+
|  1 | 84f59369-5051-11e9-9078-08002709eea3 |
+----+--------------------------------------+
1 row in set (0.05 sec)

Notice how this worked without creating the table. Since the table was created in the data nodes, all SQL nodes that connect to these data nodes will automatically know about the table.

Finally, confirm the data is also available in the third node:

node3 [localhost:27512] {msandbox} ((none)) > SELECT * FROM db1.t1;
+----+--------------------------------------+
| id | val                                  |
+----+--------------------------------------+
|  1 | 84f59369-5051-11e9-9078-08002709eea3 |
+----+--------------------------------------+
1 row in set (0.12 sec)

Verdict

It is fantastic that dbdeployer now support MySQL NDB Cluster as well. It will be a great help performing tests. I do have some comments based on my testing. It is very likely some of those are just do to the fact, that this is my initial use of dbdeployer and thus, I will not claim that I understand all details of how it works yet, so do not take the following comments as the final word – nor are the comments meant as negative criticism:

  • I find it a little confusing that a cluster is considered a replication topology. Yes, there is synchronous replication between the data nodes, but it is not related to the replication you have between two MySQL Server instances (which is also supported between two clusters). Personally, I would have called a single cluster for a single sandbox, and then allow for a (future) feature setting up two clusters with replication between them.
  • The restart_all sandbox command literally shuts down the whole cluster, then starts it again (but see also two items later). For MySQL NDB Cluster there are essentially two different types of restarts (which each can either be a normal or an initial restart):
    • System Restart: All data nodes at least are shut down together, then started together. This is what restart_all implements.
    • Rolling Restart: The cluster as a whole remains online throughout the restart phase. This is done by always leaving one data node in each node group online while restarting the data nodes. SQL nodes are restarted such that at least one SQL node is online at all times. This is the normal way to do most configuration changes as it avoids downtime. I miss this restart type.
  • There does not seem to be any way to choose between normal and initial restarts.
  • The start_all does not start the management and data nodes (only the SQL nodes are started). This may be on purpose, but seems inconsistent with stop_all that does shut down the management and data nodes. Actually, I have not been able to find a way to start the cluster cleanly. There is initialize_nodes that will start the management and data nodes, but the script will also try to start the SQL nodes and load grants into the SQL nodes.
  • The stop_all script, first shuts down the management and data nodes. Then the SQL nodes. It is better to do it in the opposite order as it avoids errors on the SQL nodes if queries are executed during the shutdown. In older versions of MySQL NDB Cluster, it could also take a long time to shut down an SQL node that had lost the connection to the data nodes.
  • The management node is given NodeId = 1 and the data nodes the subsequent ids. Data nodes can only have ids 1-48, so I always recommend reserving these ids for data nodes, and make the first management node have NodeId = 49 and SQL nodes later ids.
  • There does not seem to be any way to change the number of management nodes. The --ndb-nodes option appears to be taken as one management node, and the rest as data nodes. Maybe a better way would be to have two options like:
    • --ndb-nodegroups: The number of node groups in the cluster. The number of data nodes can then be calculated as <# Node Groups> * NoOfReplicas.
    • --ndb-mgmnodes: The number of management nodes.
  • There is no check whether the number of NDB nodes is valid. For example with --ndb-nodes=4, dbdeployer tries to create a cluster with three data nodes which is not valid with NoOfReplicas = 2.
  • I did not find any way to specify my preferred configuration of the cluster as part of the sandbox deployment.
  • Consider adding the --reload option when starting ndb_mgmd (the management node). This will make the management node check whether there are any changes to the cluster configuration (stored in /ndb_conf/config.ini) and if so apply those changes.

This may seem like a long list of comments, but I am also very well aware that support for MySQL NDB Cluster has only just been added, and that it takes time to implement all the details. Rome was not built in one day.

So, I would very much like to conclude with a big thank you to the Data Charmer. This is a great initial implementation.


How to use Redis with Kitura, a server-side Swift web framework

$
0
0

Feed: Redis Labs.
Author: Shabih Syed.

During a recent project, I needed to develop application services using varied technology stacks. One of my requirements was to pick a different programming language for each microservice in my application. While Java, Node and Python were easy choices, I wanted to try something new and obscure. During a conversation with my brother, who happens to be an active iOS developer, I learned for the first time about Swift and Kitura and decided to give it a try.

Whenever I am experimenting with a new language, I try to use it with a database. In this example, I will show how easy it is to use Redis as a data store for Swift-based microservices.

I’ll begin with a brief description of my technology stack, and then walk you through the steps to build an application with these tools:

Swift is a general purpose, multi-paradigm, compiled programming language developed by Apple Inc. for iOS, macOS, watchOS, tvOS, Linux and z/OS.

Kitura is a free open source web framework written in Swift, developed by IBM and licensed under Apache 2.0. It’s an HTTP server and web framework for writing Swift server applications.

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. With almost 1.5 billion docker pulls, it is one of the most popular NoSQL databases.

Step 1: Installing Swift

First, download and install the latest version of Xcode from the App Store.

Next, check if Swift is installed. In the terminal, type the ‘swift’ command and pass the ‘–version’ flag:
swift --version

Apple Swift version 4.2.1 (swiftlang-1000.0.42 clang-1000.10.45.1)
Target: x86_64-apple-darwin17.7.0

If Swift is not installed, then run the following –install command:

xcode-select --install

Step 2: Installing Kitura

Installing the Kitura web server is pretty easy as well.

1. Create a new directory to host your project.
mkdir MyKituraApp && cd MyKituraApp

2. Initialize the new directory as a Swift project.
swift package init --type executable

Creating executable package: MyKituraApp
Creating Package.swift
Creating README.md
Creating .gitignore
Creating Sources/
Creating Sources/MyKituraApp/main.swift
Creating Tests/
Creating Tests/LinuxMain.swift
Creating Tests/MyKituraAppTests/
Creating Tests/MyKituraAppTests/MyKituraAppTests.swift
Creating Tests/MyKituraAppTests/XCTestManifests.swift

3. To add Kitura to your dependencies, edit `Package.swift`.
Open `Package.swift` and edit it so it has the following text:

// swift-tools-version:4.2

import PackageDescription

let package = Package(
    name: "MyKituraApp",
    dependencies: [
        .package(url: "https://github.com/IBM-Swift/Kitura", from: "2.6.0")
    ],
    targets: [
        .target(
            name: "MyKituraApp",
            dependencies: ["Kitura"]),
        .testTarget(
            name: "MyKituraAppTests",
            dependencies: ["MyKituraApp"]),
    ]
)

4. Build the project to pull down your new dependency:
swift build

Go under the sources folder and edit `main.swift` so it has the following text, which will initiate the Kitura web-server.

import Kitura

let router = Router()
Kitura.addHTTPServer(onPort: 8080, with: router)
Kitura.run()

5. Since we’ve added code to `main.swift`, you’ll need to recompile the project:
swift build

6. Now you are ready to run your Swift app.
swift run

7. Navigate to http://localhost:8080 in your browser, and it will return the following:

Step 3: Get Redis

I use Redis Cloud, a fully managed Redis database-as-a-service in this example. Creating a Redis instance with Redis Cloud is easy and free, and there are other options available as well (feel free to explore them in Get Started with Redis).

1. Visit the Redis Labs Get Started page, and click SIGN UP under the “Cloud Hosted” section. You will land at the following page:

2. Login to create your subscription and select a free (30MB) Redis database.

3. Name your database and activate it.


4. Take note of your database endpoint and password.

In this example, redis-15878.c91.us-east-1-3.ec2.cloud.redislabs.com is the URL of your Redis database and 15878 is the port.

Step 4: Connecting with Redis using the Kitura-Redis client

Kitura-Redis is a pure Swift client for interacting with a Redis database.

1. To add Kitura-Redis to your dependencies, you’ll need to edit `Package.swift` again.
Open `Package.swift` and edit it so it now has the following text:

// swift-tools-version:4.2

import PackageDescription

let package = Package(
    name: "MyKituraApp",
    dependencies: [
        .package(url: "https://github.com/IBM-Swift/Kitura", from: "2.6.0"),
        .package(url: "https://github.com/IBM-Swift/Kitura-redis.git", from: "2.1.0")
    ],
    targets: [
        .target(
            name: "MyKituraApp",
            dependencies: ["Kitura","SwiftRedis"]),
        .testTarget(
            name: "MyKituraAppTests",
            dependencies: ["MyKituraApp"]),
    ]
)

2. Now you’ll use Kitura-Redis to establish a connection with your Redis database* in the cloud and set a key called “Redis” with a value “On Swift”.

This is a simple example, but you can, of course, use Redis in more complex ways.

Go under your sources folder and edit `main.swift` so it has the following text:

* Make sure to update , and with your Redis Cloud database configuration.

import Kitura
import Foundation
import SwiftRedis

let router = Router()
Kitura.addHTTPServer(onPort: 8080, with: router)

let redis = Redis()

//Establish connection with Redis

redis.connect(host: "", port: ) { (redisError: NSError?) in

   //Check for if host and port are incorrect
    if let error = redisError {
        print(error)
    }

    //If connection to Redis successful then pass in the password

    else {

        let password = ""
	    redis.auth(password) { (pwdError: NSError?) in
	    	if let errorPwd = pwdError {
	        	print(errorPwd)
	    	}
	    	else {
	    		print("Password Authentication Is Successful")
	    	}
	    }

        print("Established Connection to Redis")

        // Set a key
        redis.set("Redis", value: "on Swift") { (result: Bool, redisError: NSError?) in
            if let error = redisError {
                print(error)
            }
            // Get the same key
            redis.get("Redis") { (string: RedisString?, redisError: NSError?) in
                if let error = redisError {
                    print(error)
                }
                else if let string = string?.asString {
                    print("Redis (string)")
                }
            }
        }
    }
}
Kitura.run()

3. Since you’ve added code to `main.swift`, you’ll need to recompile the project:
swift build

4. Now you are ready to run your Swift app and will see the following:
swift run

Redis Password Authenticated
Connected to Redis
Redis on Swift

5. Using Redis CLI, you can check if the key value you set in ‘main.swift’ is successful.
redis-15878.c91.us-east-1-3.ec2.cloud.redislabs.com:15878> Keys *
1) Redis
redis-15878.c91.us-east-1-3.ec2.cloud.redislabs.com:15878> GET Redis
on Swift

For more information, visit the following pages:

Swift
Kitura
IBM Swift/Kitura
IBM Swift/Kitura-Redis

Happy Swifting with Redis!

So you want to deploy multiple containers running different R models?

$
0
0

Feed: R-bloggers.
Author: Martin Hanewald.

This tutorial is the second part of a series on professional R deployment. Please find the previous part here (How to make a dockerized plumber API secure with SSL and Basic Authentication).

If you followed the first part in this tutorial series, you have achieved the following things:

  • running your R code with a plumber API inside a docker container
  • activating HTTP Basic authentication and SSL encryption via the NGINX reverse proxy

For a single container deployment the approach we have seen is absolutely feasible, but once you repeat this process you will notice the redundancy. Why should we install and configure NGINX for every single container instance we are running in our environment? The logical thing to do is to configure this common task of authentication and encryption only once for all deployed containers. This can be achieved by utilizing docker swarm (https://docs.docker.com/engine/swarm/).

Docker swarm is able to run a collection of containers simultaneously such that they can communicate with each other over a shared virtual network. Docker swarm has a multitude of features which makes it a powerful tool even in large scale deployment, but in this tutorial we will apply a minimal setup to provide secured access for two or more containers running plumber APIs. To achieve this we will make use of the AHUB deployment framework (AHUB on Github). AHUB provides a pre-configured swarm setup to deploy analytical containers (based on R or any other language) with minimal effort.

The setup we want to achieve is the following:

Multiple containers running plumber APIs will be accessed via a single NGINX instance. The principle of AHUB is to provide a service stack for common tasks like access control & encryption and providing a GUI for manual API execution. The node stack contains multiple containers, all dedicated to a different analytical task. In this example we will run three node stack containers:

  • qunis/ahub_rnode: A minimal R plumber configuration with the endpoints /thread and /batch showcasing the use of the ahubr package
  • qunis/plumberdemo: A minimal R plumber configuration with the endpoints /echo, /plot and /sum (see the basic example on https://www.rplumber.io/)
  • qunis/prophetdemo: A more elaborate container running a timeseries forecast with the fantastic prophet library and producing an interactive dyplot 

Let‘s start!

First you need to clone the content of the AHUB repository on Github to your local machine (AHUB on GitHub) and open a Bash or Powershell session in the cloned folder.

Generating certificates and user credentials

AHUB comes with a pre-generated certificate and password file. But of course you want to change these. This is very quickly done, with two little helper containers. All you need to do is navigate to the subfolder ./configs and run the following commands (please fill in your username and password). This will create a new SSL certificate and key along with a .htpasswd file containing the MD5 hashed credentials for your user in the subfolder ./configs.

docker run --mount type=bind,src=$pwd,dst=/var qunis/openssl
docker run --mount type=bind,src=$pwd,dst=/var qunis/htpasswd username password

Configuring the stack

Docker swarm operates with a recipe, telling it which containers to spin up, which ports to publish, which volumes to mount, et cetera. Everything you would normally  configure in a single „docker run …“ statement for a singular container instance, we instead write down in the so called Compose file when working with docker swarm. For a more detailed introduction see here.

Please inspect the demo file in the main folder

version: '3.3'
services:

# -------------------------------------------
# NODE STACK (add analytical modules here)
# -------------------------------------------
# For compatibility with AHUB, container images
# need to comply with the following:
#   - publish a REST API on port 8000
#   - provide a swagger.json file in the "/" path (only for GUI)
# -------------------------------------------

  node1:
    image: qunis/ahub_rnode:1.0

# -------------------------------------------
  
  node2:
    image: qunis/plumberdemo

# -------------------------------------------
  
  node3:
    image: qunis/prophetdemo
    
# -------------------------------------------
# SERVICE STACK (DO NOT TOUCH)
# -------------------------------------------

  nginx:
    image: nginx
    ports:
      - "80:80"
      - "443:443"
    configs:
      - source: nginx_template.conf
        target: /etc/nginx/nginx.conf
    secrets:
      - source: server.crt
        target: server.crt
      - source: server.key
        target: server.key
      - source: htpasswd
        target: .htpasswd
    deploy:
      placement:
        constraints: [node.role == manager]

...(continues)...

The first block defines the node stack. Here you can add as many container images as you like. For compatibility with AHUB it is only required that plumber (or any other API) publishes on port 8000 and provides the Swagger definition file (if you want to use the GUI functionality). The latter is achieved by running the plumber $run command with parameter swagger=TRUE.

IMPORTANT: If you want to add your own images, you need to make sure, that they are hosted in a container-registry, otherwise docker swarm will not be able to find them. Either you use the public Docker Hub or set up a private registry with one of the cloud providers (Google, Azure or AWS).

The analytical nodes do not have to be R based. A python node running a combination of flask/flasgger would be compatible as well.

The second block constitutes the service stack and does not need to be changed, if you stick to the basic scenario with self-signed certificates and basic authentication. Changes here need to be made if you want to use more elaborate functionality like auto-refreshing Let‘s Encrypt certificates or Active Directory Authentication. These use-cases will be covered in future tutorials.

For now you can either leave the demo file as is or add/substitute your own container images in the node stack! Note: There is no need to configure nginx when adding containers in the node stack. This is all taken care of by AHUB.

Ramping up the swarm

Before we launch AHUB we need to prepare the docker daemon to run in swarm mode:

> docker swarm init
Swarm initialized: current node (XXX) is now a manager.

Then the whole stack can be launched by docker in swarm mode with the following command

docker stack deploy -c ./ahub.yaml mystack

This command references the Compose file ahub.yaml to deploy a stack called „mystack„. Of course you can change the name of your stack to your liking.

You should see the following output on the shell:

> docker stack deploy -c ./ahub.yaml mystack

Creating network mystack_default
Creating secret mystack_server.key
Creating secret mystack_htpasswd
Creating secret mystack_server.crt
Creating config mystack_location_template.conf
Creating config mystack_nginx_template.conf
Creating config mystack_upstream_template.conf
Creating service mystack_portainer
Creating service mystack_node1
Creating service mystack_node2
Creating service mystack_node3
Creating service mystack_nginx
Creating service mystack_redis
Creating service mystack_boss
Creating service mystack_gui
Creating service mystack_updater
>

Here is a quick intro on the components of the service stack (the code and Dockerfiles for them are located in the subfolder ./modules):

  • boss: The central management service. Besides other things, handles the detection of the node stack and fits the nginx configuration to that.
  • nginx: Our reverse proxy handling authentication, encryption and endpoint routing
  • redis: NoSQL DB as a process repository: For this minimal setup of no real importance. Comes to play when using the ahubr package and activating the advanced logging functionality of AHUB. I will dive into that in a subsequent tutorial.
  • gui: Provides a very basic GUI for interaction with the API endpoints. You can manually set parameters and view the output of an API call (only JSON output currently supported).
  • updater: Time triggering of certain activities, like node stack discovery or certificate renewal. Will be covered in future tutorials as well.
  • and finally portainer

Portainer is a very powerful browser-based container management tool. We can start checking if everything ramped up fine, by navigating to http://localhost::9000 (Portainer listens on this port). As you are starting this cluster for the first time, you need to set an admin account and then choose the „Local“ mode. After that you get to the Portainer main page, where you can click through the items Services, Containers and what else piques your interest. With Portainer you can do almost anything you can do from the docker command line interface.

Under the Services tab you should see 9 services if you stuck to the demo file. Three of them being the nodestack comprising of node1, node2 and node3. Everything is fine when you see a 1/1 behind each service.

Checking the API endpoints

You can now navigate to your endpoints via https://localhost/{nodename}/{endpoint}?{parameters}. For example https://localhost/node2/plot or https://localhost/node3/?n=24. You will be warned by your browser about the insecure certificate (because we have self-signed it, skip this warning) and be asked for the user credentials.

There is also a rudimentary GUI at https://localhost (still under development) showing you the various nodes and their endpoints so you can manually trigger a GET request for testing purposes.

Finally if you want to tear down the whole stack, you can do so with the following command

docker stack rm mystack

Summary

We were able to ramp up a set of analytical containers providing RESTful APIs by just changing a few lines in a Docker compose file. The AHUB deployment framework takes care of common tasks like encryption, authentication and a graphical user interface to the endpoints. In the next installment of this tutorial series, I will show how to run a stack on a cloud-based virtual machine with public DNS address and retrieve a proper SSL certificate from Let‘s Encrypt.

I am still looking for contributors to the AHUB project, especially a frontend developer. So if you are keen on ReactJS, give me a shout.

Der Beitrag So you want to deploy multiple containers running different R models? erschien zuerst auf QUNIS.

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – QUNIS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…


If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook

Top 10 Database Diagram Tools for MySQL

$
0
0

Feed: Databasejournal.com – Feature Database Articles.
Author: .


News Via RSS Feed


Featured Database Articles

Posted April 2, 2019

By Rob Gravelle

  • Top 10 Database Diagram Tools for MySQL

    Top 10 Database Diagram Tools for MySQL

  • Database Diagram Tool: QuickDBD

    QuickDBD is an online service that lets you Define your schema in one pane and then generates a diagram accordingly. You can then save your diagram in a variety of formats as well as collaborate with others online, but you’ll need to create an account in order to do that. The free level gives you 1 public diagram and 10 tables. Paid plans are $7 for a week, $14/month, and $95/year for unlimited tables and private diagrams.

    Right now they are currently offering free plans in exchange for publicity. Just write a review on your blog or other website and get 1 year for FREE.

  • Database Diagram Tool: SQLDBM

    SqlDBM is another online service that offers you an easy and convenient way to design your database on any browser, working away without need for any extra database engine or database modelling tools or apps. SqlDBM can incorporate any needed database rules and objects such as database keys, schemas, indexes, column constraints and relationships. It also supports reverse engineering. You just have to import an existing database into SqlDBM and run the reverse engineer process. No database credentials are required for this.

    It only supports MySQL and MS SQL at this time. There are three pricing packages ranging from free to $51 per Month for unlimited everything.

  • Database Diagram Tool: Navicat Premium

    It should be noted that some professional database clients have design capabilities built-in. These are often superior to free specialized tools. Navicat Premium is a shining example of a product that offers superlative database modeling as part of a larger product. Its sophisticated database design and modeling tool lets you visualize and edit your databases using an editor which is as visually as stunning as it is easy to use.

    Although it’s not free, many DBAs and database designers swear by Navicat.

  • Database Diagram Tool: Lucidchart

    Lucidchart is cloud-based and collaborative diagram software that is geared towards development teams. In that capacity, it helps you and your team create database diagram, as well as flowcharts, process maps, UML models, org charts (etc) on any device and platform. Lucidchart is integrated with tools such as G Suite and Microsoft Office and is well suited to designers who want a drag-and-drop interface.

    They offer a Free plan that comes with a limit of 3 diagrams and 60 objects per diagram. Paid plans range from $4.95 to $20 per month.

  • Database Diagram Tool: MySQL Workbench

    Good ole MySQL Workbench provides DBAs and developers an integrated tools environment for a variety of tasks including:

    • Database Design & Modeling
    • SQL Development
    • Database Administration
    • Database Migration

    The free Community (OSS) Edition also provides extensive capabilities for creating and manipulating database models, such as the ability to:

    • Create and manipulate a model graphically.
    • Reverse engineer a live database to a model.
    • Forward engineer a model to a script or live database.
    • Create and edit tables and insert data.
  • Database Diagram Tool: Draw.io

    Draw.io is a free online diagram application for making flowcharts, process diagrams, etc. It allows several different types of charts such as flowcharts, org charts, UML, ER and network diagrams. You can save your diagrams to cloud storage services like GDrive, Dropbox and OneDrive, or to your own computer. Some users have stated that Draw.io’s visual interface is not as nice as similar products such as lucidchart, but don’t let that stop you from trying it out for yourself.

  • Database Diagram Tool: dbForge Studio for MySQL

    dbForge Studio for MySQL is a universal GUI client for MySQL and MariaDB database development, management, and administration. As such, the IDE allows you to create and execute queries, develop and debug stored routines, automate database object management, and analyze table data via, among other things. Think of it as a commercial alternative to MySQL Workbench.

    The Database Designer can create, analyze, reverse-engineer, print and customize your MySQL databases

    dbForge Studio comes in three editions: Standard Edition ($149.95), Professional Edition ($299.95), and Enterprise Edition ($399.95).

  • Database Diagram Tool: SQLGate

    SQLGate is an integrated database management and development solution that simplifies the construction and operation of databases. It provides quick data analysis functions like filter, sort and group on grid, as well as copy, paste, sum and statistics functions like excel.

    There is a free version that is limited to 10 tables in the ERD designer. The $40/month SQLGate Suite includes SQLGate for Oracle, SQLServer, PostgreSQL, Tibero, MariaDB, and MySQL databases. You can also purchase one DB type for $15 – $25 per month, depending on the DB type.

  • Database Diagram Tool: Data Xtractor

    Data Xtractor is a Free visual data modeling inspection tool for most popular relational databases. Its graphical interface supports auto-generated detailed, structural, relationships, simple, graph or topological models. Shapes can be expanded, collapsed, minified or made transparent. You can also chain relationships, custom joins and aliases, to enhance your model without altering the database itself.

    You can virtually enhance your local data model with custom joins, chain relationships and name aliases, without actually changing anything on the database server. It also supports conceptual many-to-many relationships, the bypassing of hidden intersection tables, auto-generated relationship items, as well as the detection and expansion of relationships into connectors via simple drag and drop.

  • Database Diagram Tool: DbSchema

    DbSchema is a diagram-oriented database tool compatible with all relational and many NoSQL databases, including MySQL, Cassandra, PostgreSql, MongoDb, Redshift, SqlServer, Azure, Oracle, Teradata and more.

    Using DbSchema, you can design the schema with or without a database, save the design as project file, deploy schema on multiple databases and share the design in a team. Build-in tools allows to visually explore the database data, build and execute queries, generate random data, build forms applications or reports and more.

    DbSchema can be evaluated for 15 days for free, with no registration required. It’s available in 3 flavors: Academic (for educational purposes) at $63, Personal (for individual developers, administrators) at $127, and Commercial (for companies) at $197.

Your database schema acts as the blueprints for your database; its job is to represent the database structure, data types, and table constraints. Designing the database schema is one of the very first and most important steps in building a new database. Although not obligatory, employing specialized software does make the process much easier to accomplish. With that being said, here is a list of 10 free and commercial tools that you can use to design your MySQL databases, in no particular order.


Avengers Fan Frustration Highlights Importance of Database Technology

$
0
0

Feed: Blog Post – Corporate – DataStax.
Author: David Waugh, Senior Vice President of Market Development.

How do you know it’s time to upgrade your database technology?

When your fans cant buy tickets to your latest blockbuster movie, for example—as recently happened with Avengers fans trying to buy tickets on Fandango and other sites—or buy the latest smartphone, or re-book a flight that got cancelled, or one of the many other ‘I have to have it NOW’ things that millions and millions of consumers do every day.

One of the clearest indicators of using obsolete database technology, and likely a poor data architecture holding things together, is millions of frustrated fans wishing they had other ways to purchase your product.

The part that is so frustrating to those of us in the technology industry is that this is all completely avoidable with modern NoSQL technology like Apache Cassandra™, and the enterprise version of this, from DataStax.

Today’s data architectures need to accommodate millions of transactions in very short periods of time, and even more importantly (and harder), they need to be able to provide the data where and when their customers need it: locally at the endpoint device (think smartphones).

Here’s the problem: nobody knows where every consumer will be or exactly when they’ll engage to buy their movie ticket, or even more importantly, how long they’ll be willing to wait to get confirmation before they click away to another site. Because these consumers are highly distributed and have very short attention spans and demand instant confirmation for things, and because there are millions and millions of them, enterprises today need technology that can keep up, that can handle real-time demands at cloud scale.

And here’s the ironic part: The vendors of these obsolete legacy technologies keep saying that all you need is more hardware, more instances of the database, and more add-on technology to replicate copies of the data all over the place.

The problem is that this legacy technology was never intended for today’s loads and expectations. And yet, companies try to keep it running with more glue and tape (you know the kind…) and bubble gum. And when it all comes crashing down, it’s devastating to reputation and brand satisfaction, or Net Promoter Score (NPS). Recovering from that mass negative publicity takes years if you are lucky to recover at all.

The good news is that there’s a relatively easy solution to fix these issues—and even better, prevent them from even happening in the first place. How? By overhauling your data architecture with a highly scalable, always-on, active everywhere database that allows you to take full advantage of hybrid and multi-cloud computing environments.

Modern applications running in hybrid cloud—as long as they’re built and running on the right kind of database—don’t go down and don’t leave your customers waiting or wanting more. Period.

For a closer look into what you can do to avoid becoming the next headline about a negative customer experience, contact DataStax. We can help you not only avoid the negative brand experience but accelerate your growth and innovation in this data-driven, “I need it now” world.

eBook: Powering Growth and Innovation in a Hybrid Cloud World

READ NOW


SHARE THIS PAGE

Two more of my presentations

$
0
0

Feed: James Serra’s Blog.
Author: James Serra.

I recently made available two more presentations that you might find helpful.  Feel free to download them and present them to others (adding a line that you got them from me is all I ask).  There is a full list of all my presentations with slide decks here.  I continually update all my decks, but unfortunately SlideShare no longer allows you to update decks, so just email me at jamesserra3@gmail.com if you want the latest one.

Azure data platform overview

Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build. (slides)

Machine Learning and AI

The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases. (slides)

Accelerating Enterprise Application Migration to AWS Using Dynatrace

$
0
0

Feed: AWS Partner Network (APN) Blog.
Author: Ahmed Omran.

By Ahmed Omran, Migration Partner Solutions Architect at AWS
By Andreas Grabner, Director of Strategic Partner Enablement & Evangelism at Dynatrace

Dynatrace is an Amazon Web Services (AWS) migration partner and provides an artificial intelligence-powered platform which delivers full-stack, automated monitoring that goes beyond collecting data. It can help you address challenges in operations, DevOps, cloud migration, and customer experience.

In this post, we focus on how Dynatrace shaped their cloud migration and autonomous cloud operations capabilities through their own migration journey from legacy on-premises enterprise application to cloud-native services running on AWS.

Dynatrace’s own journey, and the journeys of several large enterprise customers they guided, highlights some of the pain points and critical actions necessary for a successful and fast enterprise application migration to AWS.

Dynatrace is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in MigrationDevOps, and Containers. The AWS Competency Program verifies, validates, and vets top APN Partners that have demonstrated customer success and deep specialization in specific solution areas or segments.

If you want to be successful in today’s complex IT environment, and remain that way tomorrow and into the future, teaming up with an AWS Competency Partner like Dynatrace is The Next Smart.

Getting Started

A typical migration project includes the following phases:

  • Discovery and planning
  • Migrating and validating application
  • Operation

Let’s walk through each phase and the corresponding tasks per phase, highlighting how Dynatrace becomes an accelerator and enabler for autonomous operations.

Phase 1: Discovery and Planning

One of the biggest challenges customers face in migration projects is not having enough details about the current environment. This hinders their ability to make the right decisions when planning what to migrate and when.

Dynatrace’s OneAgent technology supports on-premises as well as cloud stack, providing an automated live dependency map called Dynatrace Smartscape. Figure 1 shows Smartscape and the dependencies of a selected service.

Dynatrace Migration-1

Figure 1 – Smartscape shows the dependencies of a selected service.

To get Smartscape, you must install OneAgent on the components you’re looking to discover and monitor. The agent can be automatically installed and rolled out through configuration management tools such as Chef, Puppet, Ansible, or by including the download and install into any custom deployment script.

OneAgent supports a wide range of features on different operating systems. Minutes after installation, the agent automatically delivers live dependency information across the complete stack–host, processes, containers, services, applications, network connections, and log files.

In Figure 1, Smartscape shows all the dependencies of the selected Tomcat service. Those include connections to queues, web servers, app servers, and a native process. This information allows us to better plan the migration, as all depending services must be considered during the migration.

Smartscape also shows whether services currently run distributed or are single instances. In our example, we can see the services are already running distributed across four processes on four machines.

Task #1: Application Prioritization for Migration

The risk of moving specific components (host, database, processes, services, application) depends on the complexity and interdependency to the rest of the environment architecture. For example, before migrating a database we can look at Smartscape to understand which services are actively using this database and in which capacity.

Knowing the type of access, executed statements, and amount of data transferred during regular hours of operation allows for better planning and prioritization of the move groups. In some cases, you may decide to not migrate this database in favor of other services or databases that are less complex to migrate due to fewer dependencies.

Task #2: Right-Sizing the AWS Environment

When deciding which resources are required to handle the migrated load, customers often look at basic monitoring data in regards to CPU, memory, disk, and network. Dynatrace provides this information in the context of the hosts, processes, and services that will be migrated based on the outcome of Task #1.

Through the Dynatrace Timeseries API, we can automate the mapping and sizing of target AWS compute resources. Figure 2 shows the current sizing and actual resource consumption of a Windows virtual machine (VM) over a period of a week. Extracting this data via the Dynatrace REST API allows you to more efficiently and faster plan your migration.

Dynatrace Migration-2

Figure 2 – Utilization and process details of a windows server 2012R2 on Citrix environment.

When moving this Windows VM–primarily hosting Citrix–we can look at the current CPU utilization and find a best fit Amazon Elastic Compute Cloud (Amazon EC2) instance. The size of the current VM seems to be too large for the actual resource consumption.

For more details on what infrastructure monitoring data is available for hosts, processes, containers, services, and applications, check out the Dynatrace Timeseries API documentation.

Task #3: Breaking and Migrating Monolithic Applications

Re-hosting (also referred to as lift and shift) is a common migration use case. Re-architecture and Re-platform are steps that break the traditional monolithic architectures and replace individual components with cloud services, such as Amazon Relational Database Service (Amazon RDS), which replaces on-premises relational databases) and Amazon DynamoDB, which replaces NoSQL databases. These steps can also replace newly-developed microservices, including containerized or serverless.

Dynatrace’s PurePath technology gives you insights into current end-to-end architecture by tracing every single transaction through the monolithic or hybrid architecture. Not only does this data help you understand the current architecture, but it enables continuous experimentation by virtually breaking out services and seeing how the new architecture behaves, in case the application is changed along certain interfaces, endpoints, classes, or methods.

The Service Flow in Figure 3 shows how transactions flow from Apache through the different layers of your monolithic application architecture. The data shown on the edges between physical or virtual services gives insights on inter-service call patterns.

Dynatrace Migration-3

Figure 3 – Service Flow of EasyTravel application.

In this example, we learn that JourneyService is tightly coupled with CheckDestination as it calls CheckDestination in 99 percent of incoming processed requests. While migrating this application, it would be wise to keep JourneyService and CheckDestination in close proximity to avoid network latency affects.

If you want to learn more about the approach and features available in Dynatrace to virtually break your monolithic architecture, have a look at their 8-Step Recipe to Break Monoliths.

Phase 2: Migrating and Validating Application

While migrating to the cloud, you want to evaluate if your migration goes according to the plan, whether the services are still performing well or even better than before, and whether your new architecture is as efficient as the blueprint suggested. Dynatrace helps you validate all these steps automatically, which helps speed up the migration and validation process.

Task #1: Validate Migration Progress

Dynatrace allows you to use the same mechanisms (Smartscape, Service Flow) on AWS as you would on-premises to validate how the deployment turned out and if all dependencies are as expected. To install OneAgent on your Amazon EC2 instances, make sure to enable the Dynatrace AWS Integration which pulls in additional metrics and metadata from Amazon CloudWatch.

If the new architecture includes AWS Lambda, make sure to instrument your Lambda functions for end-to-end visibility. For AWS Fargate or Amazon Elastic Container Service for Kubernetes (Amazon EKS), make sure to dig into the no-touch container monitoring capabilities of Dynatrace.

The image in Figure 4 from Dynatrace Smartscape shows an example of the “easyTravel Customer Frontend” service that’s running distributed across four Tomcat instances and four Amazon EC2 machines in a single Availability Zone (AZ). That makes it easy to validate if this is the outcome you expected, or whether there’s a configuration or deployment mistake that has to be corrected.

Dynatrace Migration-4

Figure 4 – Smartscape shows the target environment deployment.

Task #2: Validate Performance and Scalability

Dynatrace helps you validate your primary motivations for moving to the AWS Cloud–be it for increased agility or higher elasticity–by allowing you to compare the performance of your source and target environments.

The screenshots in Figure 5 show the comparison feature of Dynatrace. It compares Key Performance Indicators (KPIs) between the existing environment (left) and migrated environment (right). If there’s a significant difference between performance or failure rate, like in this example, you can drill into the actual methods that spend more time or analyze the exceptions that caused these errors.

Dynatrace Migration-5

Figure 5 – Performance validation between the source and target environment.

The screenshots above show us that while the system scales up with load it also reaches a breaking point under peak load conditions. This analysis allows us to tweak the deployment and code, so we end up with a system that can scale without running into issues.

Dynatrace also gives you insights into how your architecture scales with increased load by observing the live dependency data (via Smartscape). It also looks at any artificial intelligence (AI)-detected problems under different load conditions.

Dynatrace Migration-6

Figure 6 – Smartscape output with different load condition.

Task #3: Validate and Optimize Cloud Architecture

When extracting services from your monolith, or replacing components with cloud services, the new architecture typically increases in complexity.

We can use Service Flow to validate how the architecture really looks after migration. Figure 8 clearly shows that after the migration the system became more complex. Based on this data, we can spot architectural patterns such as chatty neighbors, N+1 Query Pattern, or Recursive Calls leading to inefficient transaction processing.

Dynatrace Migration-7

Figure 7 – Post-migration service flow for EasyTravel application.

Seeing the actual architecture and understanding the current limitations in the live Service Flow allows us to fix configuration issues or add new architectural components such as caching layers, proxies, or load balancers. All of this results in a faster migration with a better chance of business success.

Phase 3: Operation

Migrating to the cloud means there are more moving pieces to monitor and manage. As we have learned, Dynatrace fully automates monitoring through OneAgent and the native integration with AWS.

Post-migration, Dynatrace’s deterministic AI engine, DAVIS, helps you run your applications smoothly, pinpointing production issues with root-cause information right at your fingertips. DAVIS leverages Smartscape dependency data, as well as the high-fidelity monitoring data from OneAgent. The AI can be fed with external events such as deployment or configuration change events from your CI/CD or deployment automation tools.

DAVIS’ unique capabilities for automated root cause detection can be integrated with ChatOps, VoiceOps, or auto-remediation actions. This brings your operation teams closer to what we call Autonomous Operations and enables business teams to make better decisions based on monitoring data by using voice or chat commands.

Task #1: Reduce MTTR with Automated Root Cause Diagnostics

Dynatrace automatically baselines all relevant metrics that are indicators of bad service levels or poor end user experience. Thanks to the extensive data set and deterministic AI, Dynatrace can open problem tickets with information about the business impact and technical root cause, correlating all events that happened in the architecture prior and during the impact.

This automated root cause diagnosis dramatically reduces Mean Time To Repair (MTTR) through detailed data and automation and integration with incident response tools such as ServiceNow, PagerDuty, xMatters, VictorOps, or OpsGenie.

You can see in Figure 8 how Dynatrace presents a problem, business impact, impacted services, and the root cause.

Dynatrace Migration-8

Figure 8 – Dynatrace root cause diagnostics for multiple service problems.

The ticket above tells us that 556 real users are impacted while we have a total of 2.04 million backend service calls affected by an issue at the moment. The root cause has been identified as a network-related issue on two backend neo4j instances that experienced a high packet retransmission rate.

Task #2: Building Self-Healing Systems

While it’s great to react to problems faster, Dynatrace enables self-healing systems through the data collected and analyzed by DAVIS. The same data is accessible through the Dynatrace REST API and can be used by auto-remediation tools such as Ansible, ServiceNow, and StackStorm to execute specific remediation actions based on the actual root cause detected.

We can also trigger a Lambda function that interacts with both Dynatrace and AWS to fix a current problem.

Task #3: Empower Teams Through VoiceOps and ChatOps

Dynatrace opens up data through integrations with VoiceOps and ChatOps solutions such as Slack or Alexa. It also supports Alexa push notifications and is demonstrated in this video.

To get started, visit the Dynatrace website start interacting with the Dynatrace AI.

Summary

In this post, we covered how Dynatrace helps customers migrate to the AWS Cloud with confidence, break monolithic apps, mitigate risks, validate, and operate at scale.

After installing Dynatrace OneAgent on your existing architecture, you get automated insights allowing you to better plan your migration strategy. You also get the ability to compare and optimize your environment, and once migrated to the cloud Dynatrace is all about making operations more efficient and autonomous.

The best way to get started is by signing up for the Dynatrace SaaS Trial and exploring the material on Dynatrace’s blog as well as many YouTube tutorials.

VIDEO: AWS Migration Competency Partner Dynatrace (2:28)

.


Dynatrace Logo-2
Connect with Dynatrace-1

Dynatrace – APN Partner Spotlight

Dynatrace is an AWS Competency Partner. Their AI-powered, full stack, and completely automated solution provides answers, not just data, based on deep insight into every user, every transaction, across every application.

Contact Dynatrace | Solution Overview | Buy on Marketplace

*Already worked with Dynatrace? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.

How Skype modernized its backend infrastructure using Azure Cosmos DB – Part 2

$
0
0

Feed: Microsoft Azure Blog.
Author: Parul Matah.

This is a three-part blog post series about how organizations are using Azure Cosmos DB to meet real world needs, and the difference it’s making to them. In part 1, we explored the challenges Skype faced that led them to take action. In this post (part 2 of 3), we examine how Skype implemented Azure Cosmos DB to modernize its backend infrastructure. In part 3, we’ll cover the outcomes resulting from those efforts.

Note: Comments in italics/parenthesis are the author’s.

The solution

Putting data closer to users

Skype found the perfect fit in Azure Cosmos DB, the globally distributed NoSQL database service from Microsoft. It gave Skype everything needed for its new People Core Service (PCS), including turnkey global distribution and elastic scaling of throughput and storage, making it an ideal foundation for distributed apps like Skype that require extremely low latency at global scale.

Initial design decisions

Prototyping began in May 2017. Some early choices made by the team included the following:

  • Geo-replication: The team started by deploying Azure Cosmos DB in one Azure region, then used its pushbutton geo-replication to replicate it to a total of seven Azure regions: three in North America, two in Europe, and two in the Asia Pacific (APAC) region. However, it later turned out that a single presence in each of those three geographies was enough to meet all SLAs.
  • Consistency level: In setting up geo-replication, the team chose session consistency from among the five consistency levels supported by Azure Cosmos DB. (Session consistency is often ideal for scenarios where a device or user session is involved because it guarantees monotonic reads, monotonic writes, and read-your-own-writes.)
  • Partitioning: Skype chose UserID as the partition key, thereby ensuring that all data for each user would reside on the same physical partition. A physical partition size of 20GB was used instead of the default 10GB size because the larger number enabled more efficient allocation and usage of request units per second (RU/s)—a measure of pre-allocated, guaranteed database throughput. (With Azure Cosmos DB, each collection must have a partition key, which acts as a logical partition for the data and provides Azure Cosmos DB with a natural boundary for transparently distributing it internally, across physical partitions.)

Event-driven architecture based on Azure Cosmos DB change feed

In building the new PCS service, Skype developers implemented a micro-services, event-driven architecture based on change feed support in Azure Cosmos DB. Change feed works by “listening” to an Azure Cosmos DB container for any changes and outputting a sorted list of documents that were changed, in the order in which they were modified. The changes are persisted, can be processed asynchronously and incrementally, and the output can be distributed across one or more consumers for parallel processing. (Change Feed in Azure Cosmos DB is enabled by default for all accounts, and it does not incur any additional costs. You can use provisioned RU/s to read from the feed, just like any other operation in Azure Cosmos DB.)

“Generally, an event-driven architecture uses Kafka, Event Hub, or some other event source,” explains Kaduk. “But with Azure Cosmos DB, change feed provided a built-in event source that simplified our overall architecture.”

To meet the solution’s audit history requirements, developers implemented an event sourcing with capture state pattern. Instead of storing just the current state of the data in a domain, this pattern uses an append-only store to record the full series of actions taken on the data (the “event sourcing” part of the pattern), along with the mutated state (i.e. the “capture state”). The append-only store acts as the system of record and can be used to materialize domain objects. It also provides consistency for transactional data, and maintains full audit trails and history that can enable compensating actions.

Separate read and write paths and data models for optimal performance

Developers used the Command and Query Responsibility Segregation (CQRS) pattern together with the event sourcing pattern to implement separate write and read paths, interfaces, and data models, each tailored to their relevant tasks. “When CQRS is used with the Event Sourcing pattern, the store of events is the write model, and is the official source of information capturing what has happened or changed, what was the intention, and who was the originator,” explains Kaduk. “All of this is stored on one JSON document for each changed domain aggregate—user, person, and group. The read model provides materialized views that are optimized for querying and are stored in a second, smaller JSON documents. This is all enabled by the Azure Cosmos DB document format and the ability to store different types of documents with different data structures within a single collection.” Find more information on using Event Sourcing together with CQRS.

Custom change feed processing

Instead of using Azure Functions to handle change feed processing, the development team chose to implement its own change feed processing using the Azure Cosmos DB change feed processor library—the same code used internally by Azure Functions. This gave developers more granular control over change feed processing, including the ability to implement retrying over queues, dead-letter event support, and deeper monitoring. The custom change feed processors run on Azure Virtual Machines (VMs) under the “PaaS v1” model.

“Using the change feed processor library gave us superior control in ensuring all SLAs were met,” explains Kaduk. “For example, with Azure Functions, a function can either fail or spin-and-wait while it retries. We can’t afford to spin-and-wait, so we used the change feed processor library to implement a queue that retries periodically and, if still unsuccessful after a day or two, sends the request to a ‘dead letter collection’ for review. We also implemented extensive monitoring—such as how fast requests are processed, which nodes are processing them, and estimated work remaining for each partition.” (See Frantisek’s blog article for a deeper dive into how all this works.)

Cross-partition transactions and integration with other services

Change feed also provided a foundation for implementing background post-processing, such as cross-partition transactions that span the data of more than one user. The case of John blocking Sally from sending him messages is a good example. The system accepts the command from user John to block user Sally, upon which the request is validated and dispatched to the appropriate handler, which stores the event history and updates the query able data for user John. A postprocessor responsible for cross-partition transactions monitors the change feed, copying the information that John blocked Sally into the data for Sally (which likely resides in a different partition) as a reverse block. This information is used for determining the relationship between peers. (More information on this pattern can be found in the article, “Life beyond Distributed Transactions: an Apostate’s Opinion.”)

Similarly, developers used change feed to support integration with other services, such as notification, graph search, and chat. The event is received on background by all running change feed processors, one of which is responsible for publishing a notification to external event consumers, such as Azure Event Hub, using a public schema.

Azure Cosmos DB flowchart

Migration of user data

To facilitate the migration of user data from SQL Server to Azure Cosmos DB, developers wrote a service that iterated over all the user data in the old PCS service to:

  • Query the data in SQL Server and transform it into the new data models for Azure Cosmos DB.
  • Insert the data into Azure Cosmos DB and mark the user’s address book as mastered in the new database.
  • Update a lookup table for the migration status of each user.

To make the entire process seamless to users, developers also implemented a proxy service that checked the migration status in the lookup table for a user and routed requests to the appropriate data store, old or new. After all users were migrated, the old PCS service, the lookup table, and the temporary proxy service were removed from production.

Migration for production flowchart

Migration for production users began in October 2017 and took approximately two months. Today, all requests are processed by Azure Cosmos DB, which contains more than 140 terabytes of data in each of the replicated regions. The new PCS service processes up to 15,000 reads and 6,000 writes per second, consuming between 2.5 million and 3 million RUs per second across all replicas. A process monitors that RU usage automatically scaling allocated RUs up or down as needed.

Continue on to part 3, which covers the outcomes resulting from Skype’s implementation of Azure Cosmos DB.


Why a data scientist is not a data engineer

$
0
0

Feed: All – O’Reilly Media.
Author: Jesse Anderson.

Comparing apples and oranges
Comparing apples and oranges

(source: frankieleon on Flickr)

“A scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him.”

–Gordon Lindsay Glegg, The Design of Design (1969)

A few months ago, I wrote about the differences between data engineers and data scientists. I talked about their skills and common starting points.

An interesting thing happened: the data scientists started pushing back, arguing that they are, in fact, as skilled as data engineers at data engineering. That was interesting because the data engineers didn’t push back saying they’re data scientists.

So, I’ve spent the past few months gathering data and observing the behaviors of data scientists in their natural habitat. This post will offer more information about why a data scientist is not a data engineer.

Why does this even matter?

Some people complained that this data scientist versus data engineer is a mere focus on titles. “Titles shouldn’t hold people back from learning or doing new things,” they argued. I agree; learn as much as you can. Just know that your learning may only scratch the surface of what’s necessary to put something in production. Otherwise, this leads to failure with big data projects.

It’s also feeding into the management level at companies. They’re hiring data scientists expecting them to be data engineers.

I’ve heard this same story from a number of companies. They all play out the same: a company decides that data science is the way to get VC money, tons of ROI, mad street cred in their business circle, or some other reasons. This decision happens at C-level or VP-level. Let’s call this C-level person Alice.

The company goes on an exhaustive search to find the best data scientist ever. Let’s call this data scientist Bob.

It’s Bob’s first day. Alice comes up to Bob and excitedly tells him about all the projects she has in mind.

“That’s great. Where are these data pipelines and where is your Spark cluster?” Bob asks.

Alice responds, “That’s what we’re expecting you to do. We hired you to do data science.”

“I don’t know how to do any of that,” says Bob.

Alice looks at him quizzically, “But you’re a data scientist. Right? This is what you do.”

“No, I use the data pipelines and data products that are already created.”

Alice goes back to her office to figure out what happened. She stares at overly simplistic diagrams like the one shown in Figure 1 and can’t figure out why Bob can’t do the simple big data tasks.

venn diagram with data scientists and data engineers
Figure 1. Overly simplistic venn diagram with data scientists and data engineers. Illustration by Jesse Anderson, used with permission.

The limelight

There are two questions that come out of these interactions:

  • Why doesn’t management understand that data scientists aren’t data engineers?
  • Why do some data scientists think they’re data engineers?

I’ll start with the management side. Later on, we’ll talk about the data scientists themselves.

Let’s face it. Data engineering is not in the limelight. It isn’t being proclaimed as the best job of the 21st century. It isn’t getting all of the media buzz. Conferences aren’t telling CxOs about the virtues of data engineering. If you only look at the cursory message, it’s all about data science and hiring data scientists.

This is starting to change. We have conferences on data engineering. There is a gradual recognition of the need for data engineering. I’m hoping pieces like this one shed light on this necessity. I’m hoping my body of work will educate organizations on this critical need.

Recognition and appreciation

Even when organizations have data science and data engineering teams, there is still a lack of appreciation for the work that went into the data engineering side.

You even see this lack of credit during conference talks. The data scientist is talking about what they’ve created. I can see the extensive data engineering that went into their model, but it’s never called out during the talk. I don’t expect the talk to cover it in detail, but it would be nice to acknowledge the work that went into enabling their creation. Management and beginners to data science perceive that everything was possible with the data scientist’s skill set.

How to get appreciation

Lately, I’ve been getting questions from data engineers on how to get into their company’s limelight. They’re feeling that when a data scientist goes to show their latest creation, they’re either taking all of the credit or they’re given all of the credit by the management. Their basic question is: “How can I get the data scientists to stop taking credit for something that was both of our work?”

That’s a valid question from what I’m seeing at companies. Management doesn’t realize (and it isn’t socialized) the data engineering work that goes into all things data science. If you’re reading this and you’re thinking:

  • My data scientists are data engineers
  • My data scientists are creating really complicated data pipelines
  • Jesse must not know what he’s talking about

…you probably have a data engineer in the background who isn’t getting any limelight.

Similar to when data scientists quit without a data engineer, data engineers who don’t get recognition and appreciation will quit. Don’t kid yourself; there’s an equally hot job market for qualified data engineers as there is for data scientists.

Data science only happens with a little help from our friends

myth of Atlas
Figure 2. Even the Italians knew the importance of data engineers in the 1400s. Image from the Met Museum, public domain.

You might have heard about the myth of Atlas. He was punished by having to hold up the world/sky/celestial spheres. The earth only exists in its current form because Atlas holds it up.

In a similar way, data engineers hold up the world of data science. There isn’t much thought or credit that goes to the person holding up the world, but there should be. All levels of an organization should understand that data science is only enabled through the work of the data engineering team.

Data scientists aren’t data engineers

That brings us to why data scientists think they’re data engineers.

A few caveats to head off comments before we continue:

  • I think data scientists are really smart, and I enjoy working with them.
  • I’m wondering if this intelligence causes a higher IQ Dunning-Kruger effect.
  • Some of the best data engineers I’ve known have been data scientists, though this number is very small.
  • There is a consistent overestimation when assessing our own skills.
Empirical diagram of data scientists’ perceived data engineering skills versus their actual skills
Figure 3. Empirical diagram of data scientists’ perceived data engineering skills versus their actual skills. Illustration by Jesse Anderson, used with permission.

In talking to data scientists about their data engineering skills, I’ve found their self-assessments to vary wildly. It’s an interesting social experiment in biases. Most data scientists over assessed their own data engineering abilities. Some gave an accurate assessment, but none of them gave a lower assessment than their actual ability.

There are two things missing from this diagram:

  • What is the skill level of data engineers?
  • What is the skill level needed for a moderately complicated data pipeline?
Empirical diagram of data scientists’ and data engineers’ data engineering skills
Figure 4. Empirical diagram of data scientists’ and data engineers’ data engineering skills with the skill needed to create a moderately complicated data pipeline. Illustration by Jesse Anderson, used with permission.

From this figure, you can start to see the differences in the required data engineering abilities. In fact, I’m being more generous with the number of data scientists able to create a moderately complicated data pipeline. The reality may be that data scientists should be half of what the diagram shows.

Overall, it shows the approximate portions of the two groups who can and cannot create data pipelines. Yes, some data engineers can’t create a moderately complicated data pipeline. Conversely, most data scientists can’t, either. This comes back to the business issue at hand: organizations are giving their big data projects to individuals who lack the ability to succeed with the project.

You might think, “Good, so 20% of my data scientists can actually do this. I don’t need a data engineer after all.” First, remember this chart is being charitable in showing data scientists’ abilities. Remember that moderately complicated is still a pretty low bar. I need to create another diagram to show how few data scientists can handle the next step up in complexity. This is where the percentage drops to 1% or less of data scientists.

Why aren’t data scientists data engineers?

Sometimes I prefer to see the reflected manifestations of problems. These are a few examples of the manisted problems that make data scientists lack the data engineering skill set.

University and other courses

Data science is the hot new course out there for universities and online courses. There are all sorts of offerings out there, but virtually all of them have the same problem: they either completely lack or have one data engineering class.

When I see a new university’s data science curriculum announced, I take a look at it. Sometimes, I’ll be asked for comments on a university’s proposed data science curriculum. I give them same feedback: “Are you expecting expert programmers? Because there isn’t any coverage of the programming or systems required to even consume a data pipeline that’s been created.”

The course outlines generally focus on the statistics and math required. This reflects what companies and academics think data science should look like. The real world looks rather different. The poor students are left to fend for themselves for the rest of these non-trivial learnings.

We can take a step back and look at this academically by looking at course requirements for a master’s degree in distributed systems. Obviously, a data scientist doesn’t need this level of depth, but it helps show what’s missing and the big holes in a data scientist’s skill set. There are some major deficiencies.

Data engineering != Spark

A common misconception from data scientists—and management—is that data engineering is just writing some Spark code to process a file. Spark is a good solution for batch compute, but it isn’t the only technology you’ll need. A big data solution will require 10-30 different technologies all working together.

This sort of thinking lies at the heart of big data failures. Management thinks they have a new silver bullet to kill all of their big data problems. The reality is far more complicated than that.

When I mentor an organization on big data, I check for this misconception at all layers of the organization. If it does exist, I make sure I talk about all of the technologies they’ll need. This removes the misconception that there’s an easy button in big data and there’s a single technology to solve all of it.

Where is the code from?

Sometimes data scientists will tell me how easy data engineering is. I’ll get them to tell me how and why they think that. “I can get all the code I need from StackOverflow or Reddit. If I need to create something from scratch, I can copy someone’s design in a conference talk or a whitepaper.

To the non-engineer, this might seem OK. To the engineer, this starts major alarm bells. The legal issues aside, this isn’t engineering. There are very few cookie-cutter problems in big data. Everything after “hello world” has more complexity that needs a data engineer because there isn’t a cookie-cutter approach to dealing with it. Getting your design copied from a white paper could lead to a poor performing design or worse.

I’ve dealt with a few data science teams who’ve tried this monkey-see-monkey-do approach. It doesn’t work well. This is due to big data’s spike in complexity and the extreme focus on use cases. The data science team will often drop the project as it exceeds their data engineering abilities.

Put simply, there’s a big difference between “I can copy code from stackoverflow” or “I can modify something that’s already been written” and “I can create this system from scratch.”

Personally, I’m worried that data science teams are going to be these sources of massive technical debt that squelches big data productivity in organizations. By the time it’s found out, the technical debt will be so high it might be infeasible to correct it.

What’s the longest their code has been in production?

A core difference for data scientists is their depth. This depth is shown in two ways. What’s the longest time their code been in production—or has it ever been in production? What is the longest, largest, or most complicated program they have ever written?

This isn’t about gamesmanship or who’s better; it’s showing if they know what happens when you put something in production and how to maintain code. Writing a 20-line program is comparatively easy. Writing 1,000 lines of code that’s maintainable and coherent is another situation all together. People who’ve never written more than 20 lines don’t understand the miles of difference in maintainability. All of their complaints about Java verbosity or why programming best practices need to be used come into focus with large software projects.

Moving fast and breaking things works well when evaluating and discovering data. It requires a different and more intense level when working with code that goes into production. It’s for reasons like these that most data scientist’s code gets rewritten before it goes into production.

When they design a distributed system

One way to know the difference between data scientists and data engineers is to see what happens when they write their own distributed systems. A data scientist will write one that is very math focused but performs terribly. A software engineer with a specialization in writing distributed systems will create one that performs well and is distributed (but seriously don’t write your own). I’ll share a few stories of my interactions with organizations where data scientists created a distributed system.

A business unit that was made up of data scientists at my customer’s company created a distributed system. I was sent in to talk to them and get an understanding of why they created their own system and what it could do. They were doing (distributed) image processing.

I started out by asking them why they created their own distributed system. They responded that it wasn’t possible to distribute the algorithm. To validate their findings, they contracted another data scientist with a specialty in image processing. The data scientist contractor confirmed that it wasn’t possible to distribute the algorithm.

In the two hours I spent with the team, it was clear that the algorithm could be distributed on a general-purpose compute engine, like Spark. It was also clear that the distributed system they wrote wouldn’t scale and had serious design flaws. By having another data scientist validate their findings instead of a qualified data engineer, they had another novice programmer validate their novice findings.

At another company run by mathematicians, they told me about the distributed system they wrote. It was written so that math problems could be run on other computers. A few things were clear after talking to them. They could have used a general-purpose compute engine and been better off. The way they were distributing and running jobs was inefficient. It was taking longer to do the RPC network traffic than it was to perform the calculation.

There are commonalities to all of these stories and others I didn’t tell:

  • Data scientists focus on the math instead of the system. The system is there to run math instead of running math efficiently.
  • Data engineers know the tricks that aren’t math. We’re not trying to cancel out infinities.
  • A data scientist asks, “how can I get a computer to do my math problems?” A data engineer asks, “how can I get a computer to do my math problems as fast and efficiently as possible?”
  • The organizations could have saved themselves time, money, and heartache by using a general-purpose engine instead of writing their own.

What’s the difference?

You’ve made it this far and I hope I’ve convinced you: data scientists are not data engineers. But really, what difference does all of this make?

The difference between a data scientist and a data engineer is the difference between an organization succeeding or failing in their big data project.

Data science from an engineering perspective

When I first started to work with data scientists, I was surprised at how little they begged, borrowed, and stole from the engineering side. On the engineering front, we have some well-established best practices that weren’t being used on the data science side. A few of these are:

  • Source control
  • Continuous integration
  • Project management frameworks like Agile or Scrum
  • IDEs
  • Bug tracking
  • Code reviews
  • Commenting code

You saw me offhandedly mention the technical debt I’ve seen in data science teams. Let me elaborate on why I’m so worried about this. When I start pushing on a data science team to use best practices, I get two answers: “we know and we’re going to implement that later” or “we don’t need these heavyweight engineering practices. We’re agile and nimble. These models won’t go into production yet.” The best practices never get implemented and that model goes straight into production. Each one of these issues leads to a compounding of technical debt.

Code quality

Would you put your intern’s code into production? If you’re in management, go ask your VP of engineering if they’ll put a second-year computer science student’s code into production. You might get a vehement no. Or they might say after the code was reviewed by other members of the team.

Are you going to put your data scientist’s code into production? Part of the thrust of this article is that data scientists are often novices at programming—at best—and their code is going into production. Take a look back up at the best practices that data science teams aren’t doing. There are no checks and balances to keep amateur code from going into production.

Why did they get good?

I want to end this by addressing the people who are still thinking their data scientists are data engineers. Or those data scientists who are also qualified data engineers. I want to restate that you can see from the figure it is possible, just not probable.

If this is true, I’d like you to think about why this happened.

In my experience, this happens when the ratio of data scientists to data engineers is well out of alignment. This happens when the ratio is inverted and there are zero data engineers in the organization. There should be more like two to five data engineers per data scientist. This ratio is needed because more time goes into the data engineering side than the data science.

When teams lack the right ratio, they’re making poor use of their data scientists’ time. Data scientists tend to get stuck on the programming parts that data engineers are proficient in. I’ve seen too many data scientists spend days on something that would take a data engineer an hour. This incorrectly perceived and solved problem leads organizations to hire more data scientists instead of hiring the right people who make the process more efficient.

Other times, they’re misunderstanding what a data engineer is. Having unqualified or the wrong type of data engineer is just as bad. You need to make sure you’re getting qualified help. This leads to the fallacy that you don’t need a data engineer because the ones you’ve worked with aren’t competent.

I’m often asked by management how they should get their data scientists to be more technically proficient. I respond that this is more a question of should the data scientists become more technically proficient. This is important for several reasons:

  • There’s a low point of diminishing returns for a data science team that isn’t very technical to begin with. They can study for months, but may never get much better.
  • It assumes that a data scientist is a data engineer and that isn’t correct. It would be better to target the one or two individuals on the data science team with the innate abilities to get better.
  • Is there an ROI to this improvement? If the data science team gets better, what could it do better or different?
  • It assumes the highest value is to improve the data science team. The better investment may be in improving the data engineering team and facilitating better communication and relations between the data science and data engineering teams.
  • It assumes that the data scientists actually want to improve technically. I’ve found that data scientists consider data engineering a means to an end. By doing the data engineering work, they get to do the fun data science stuff.

What should we do?

Given that a data scientist is not a data engineer, what should we do? First and foremost, we have to understand what data scientists and data engineers do. We have to realize this isn’t a focus on titles and limiting people based on that. This is about a fundamental difference in what each person is good at and their core strengths.

Having a data scientist do data engineering tasks is fraught with failure. Conversely, having a data engineer do data science is fraught with failure. If your organization is trying to do data science, you need both people. Each person fulfills a complementary and necessary role.

For larger organizations, you will start to see the need for people who split the difference between the data scientist and data engineer skill sets. I recommend the management team look at creating a machine learning engineer title and hiring for it.

Success with big data

As you’ve seen here, the path to success with big data isn’t just technical—there are critical management parts. Misunderstanding the nature of data scientists and data engineers is just one of those. If you’re having trouble with your big data projects, don’t just look for technical reasons. The underlying issue may be a management or team failure.

As you’re doing a root cause analysis of why a big data project stalled or failed, don’t just look at or blame the technology. Also, don’t just take the data science team’s explanation because they may not have enough experience to know or understand why it failed. Instead, you’ll need to go deeper—and often more painfully—to look at the management or team failings that led to a project failure.

Failures like these form a repeating and continuous pattern. You can move to the newest technology, but you’re just fixing the systemic issues. Only by fixing the root issue can you start to be successful.

Hazelcast IMDG 3.12 is Released

$
0
0

Feed: Blog – Hazelcast.
Author: David Brimley.

Hazelcast IMDG 3.12 is Released

We are pleased to announce the production-ready release of Hazelcast IMDG 3.12.

We’ve crushed a lot of bugs, provided general performance improvements, plus we’ve added some great new features (more of which below).

The release by numbers:

  • 676 Issues
  • 784 Pull Requests
  • 47 Committers
  • 168 Days Elapsed

CP Subsystem

The new CP Subsystem provides implementations of Hazelcast’s existing concurrency APIs (locks, atomics, semaphores, latches) using a system based on the Raft Consensus Algorithm. Correct operation of concurrent data structures is a must, especially during network failures. This new feature provides a level of consistency not available in other IMDG and many NoSQL products. As the name of the module implies, these implementations are CP with respect to the CAP principle and live alongside AP data structures in the same Hazelcast IMDG cluster. They maintain linearizability in all cases, including client and server failures, network partitions, and prevent split-brain situations. This also includes reiterated and clarified distributed execution and failure semantics of APIs, as well as multiple improvements on the API level. Last, we introduce a new API, FencedLock, that extends the semantics of java.util.concurrent.locks.Lock to cover various failures models faced in distributed environments.

Hazelcast is the first and only IMDG that offers a linearizable and distributed implementation of the Java concurrency primitives backed by the Raft consensus algorithm.

You can find a more in-depth blog post on this new feature by one of the engineers that wrote it, here

JSON Support

Hazelcast now recognises JSON data structures when saved as a value to an IMap using the new HazelcastJsonValue type. Once saved, all standard operations can be carried out, such as Predicate Queries and Aggregations. Hazelcast support for in-memory JSON storage provides a 400% increase in throughput when compared to other NoSQL Document Stores. JSON support will be available for all Hazelcast Clients including Java, C#, C++, Python, NodeJS & Go.

Advanced Network Configuration

Before IMDG 3.12, all network communications to a cluster member were handled via the same listener configuration. Now, unique network configurations can be applied to client-member, member-member, wan-wan, rest & memcache endpoints. Practically, this now means that different public addresses and ports can be used, as well as different TLS certificates and TCP/IP configurations per service.

YAML Configuration Support

Cluster and Client configuration can now be expressed in YAML, as well as existing XML, Spring Beans or API Config.

First-Class Support for Kubernetes

A first-class experience for users of Hazelcast in Kubernetes is one of the primary goals of this release. We’ve now made it easier to connect Hazelcast clients that sit outside the K8s network. Previously, clients could only connect via a single member, which added an extra hop and additional latency. Now smart clients outside of the K8s network can connect directly to all members of the Hazelcast cluster, and therefore perform 1 hop operations on data. There are new improvements around the placement of backups within the Hazelcast cluster, assuming correct labeling of the K8s cluster members, Hazelcast can now place backups on different physical hardware. Thus ensuring no data loss when a physical machine in the K8s cluster crashes. Finally, extensive verification work has been carried out to ensure that Hazelcast Enterprise features work smoothly within K8s.

Blue/Green Deployments (Enterprise Only)

Blue/Green Deployments reduce downtime and risk by running two identical Hazelcast® Enterprise IMDG clusters called Blue and Green. One of the clusters provides production services to clients while the other cluster is upgraded with the new application code. Hazelcast Blue/Green Deployment functionality allows clients to be migrated from one cluster to another without client interaction or cluster downtime. All clients of a cluster may be migrated, or groups of clients can be moved with the use of label filtering and black/white lists.

Automatic Disaster Recovery Fail-over (Enterprise Only)

Automatic disaster recovery fail-over provides clients connected to Hazelcast® Enterprise IMDG clusters with the ability to automatically fail-over to a Disaster Recovery cluster should there be an unplanned outage of the primary production Hazelcast cluster.

Add and Remove WAN Publishers (Enterprise Only)

WAN Publishers can now be added and removed in a running cluster via a new REST API.

WAN Replication Performance Improvements (Enterprise Only)

During lab benchmarks, optimisations to the WAN service now show a 300% increase in throughput and average latencies are reduced by 600%.

Java 8

Java 8 is now the minimum required run-time SDK for Hazelcast.

Cloud Discovery

The AWS discovery plugin has been a part of the main Hazelcast distribution jar for some time now, in this release we now add Kubernetes and Google Cloud Platform (GCP). Previously, Kubernetes and GCP discovery were available as separate downloadable plugins.

Ingestion Pipeline Convenience

A new convenience API for rapid populations of the cluster is now available. The Pipelining API manages multiple async ingests.

Client Labelling and Naming Improvements

Clients can now pass labels and instance names to the cluster.

Closing Words

Please download Hazelcast IMDG 3.12 and try it out. We want to hear what you think, so please stop by our Gitter channel and Google Group. If you would like to introduce some changes or contribute, please take a look at our or GitHub repository.

What Makes a Database Cloud-Native?

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL has been designed and developed as a distributed relational database, bringing the effectiveness of the relational database model into the new world of the cloud, containers, and other software-defined infrastructure – as described in a new report from 451 Research. Today, most of our customers run our software using some combination of the cloud and containers, with many also running it on-premises.

Today, we are purveyors of the leading platform-independent NewSQL database. Having recently joined the Cloud Native Computing Federation, we’d like to take this opportunity to answer the question: “What makes a database cloud-native?”

Cloud-Native Software Definition

There are many definitions of “cloud-native software” available. 451 Research states that cloud-native software is “designed from the ground up to take advantage of cloud computing architectures and automated environments, and to leverage API- driven provisioning, auto-scaling and other operational functions.”

The company continues: “Cloud-native architecture and software include applications that have been redesigned to take advantage of cloud computing architectures, but are not limited to cloud applications – we see cloud-native technologies and practices present in on-premises environments in the enterprise.”

The point is repeated in one of the major headings in the report: “Cloud-native isn’t only in the cloud.” 451 Research commonly finds cloud-native technologies and practices being used in on-premises environments.

What Cloud-Native Means for MemSQL

Let’s break down the 451 Research definition of cloud-native and see how it applies to MemSQL.

Takes Advantage of Cloud Features

The first point from the 451 Research report states that cloud-native software is “designed from the ground up to take advantage of cloud computing architectures and automated environments”.

MemSQL has been available on the major public cloud platforms for years, and deployments are balanced across cloud and on-premises environments. More importantly, MemSQL’s unique internal architecture gives it both the scalability that are inherent to the cloud and the ability to support SQL for transactions and analytics.

An important step has been qualifying MemSQL for use in containers. MemSQL has been running in containers for a long time, and we use a containerized environment for testing our software.

451 Research shows a spectrum of cloud-native software services.

Leverages Software Automation

The report then goes into more detail on this point. Cloud-native software will “leverage API-driven provisioning, auto-scaling and other operational functions.” The ultimate goal here is software-defined infrastructure, in which the software stack is platform-independent and can be managed automatically, by other software.

MemSQL has command-line tools that integrate easily with on-premises deployment tools, such as Ansible, Chef, and Puppet, and cloud deployment mechanisms such as Azure Resource Management and CloudFormation. This is crucial to the definition and nature of cloud-native, and MemSQL’s automatability is crucial to its inclusion as cloud-native software.

MemSQL Studio provides a monitoring environment for MemSQL across deployment platforms – that is, across public cloud providers, private cloud, and on-premises.

Not Limited to Cloud Applications

Concluding their key points, 451 Research then states: “Cloud-native architecture and software include applications that have been redesigned to take advantage of cloud computing architectures, but are not limited to cloud applications… .”

The point here is that “cloud-native” doesn’t mean “cloud-only”. Cloud-native describes a set of capabilities that can be deployed anywhere — in public cloud providers, in modernized data centers, and increasingly at the edge.

The cloud-native movement combines with the unique features of MemSQL to create something really exceptional: a database that can leverage different deployment locations with ease. Flexibility and portability are creating a capability that hasn’t been available before.

Specific MemSQL features make it particularly suitable for cloud-native deployments:

  • Container-friendly. As mentioned above, MemSQL runs well in containers – which is a defining characteristic for cloud-native software.
  • Fully scalable. Like NoSQL databases, and unlike traditional relational databases, MemSQL is fully scalable within a cloud, on-premises, or across clouds and on-prem.
  • Kafka and Spark integration. Apache Kafka and Apache Spark are widely used for data transfer in cloud-native applications, and both work very smoothly with MemSQL Pipelines.
  • Microservices support. MemSQL’s performance, scalability, and flexibility are useful in microservices implementations, considered emblematic of cloud-native software.

Next Steps

MemSQL’s architecture and capabilities are unique and allow for unbeatable performance and effortless scale — especially when paired with elastic cloud infrastructure. An example is customers who want to move from on-premises Oracle deployments to cloud-native technologies. MemSQL improves on Oracle’s performance and reduces cost while modernizing data infrastructure.

Try MemSQL for free today, or contact us to learn how we can help support your cloud adoption plans

Pre-Modern Databases: OLTP, OLAP, and NoSQL

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

In this blog post, the first in a two-part series, I’m going to describe pre-modern databases: traditional relational databases, which support SQL but don’t scale out, and NoSQL databases, which scale out but don’t support SQL. In the next part, I’m going to talk about modern databases – which scale out, and which do support SQL – and how they are well suited for an important new workload: operational analytics.

In the Beginning: OLTP

Online transaction processing (OLTP) emerged several decades ago as a way to enable database customers to create an interactive experience for users, powered by back-end systems. Prior to the existence of OLTP, a customer would perform an activity. Only at some point, relatively far off in the future, would the back-end system be updated to reflect the activity. If the activity caused a problem, that problem would not be known for several hours, or even a day or more, after the problematic activity or activities.

The classic example (and one of the main drivers for the emergence of the OLTP pattern) was the ATM. Prior to the arrival of ATMs, a customer would go to the bank counter to withdraw or deposit money. Back-end systems, either paper or computer, would be updated at the end of the day. This was slow, inefficient, error prone, and did not allow for a real-time understanding of the current state. For instance, a customer might withdraw more money than they had in their account.

With the arrival of ATMs, around 1970, a customer could self-serve the cash withdrawal, deposits, or other transactions. The customer moved from nine to five access to 24/7 access. ATMs also allowed a customer to understand in real time what the state of their account was. With these new features, the requirements for the backend systems became a lot more complex. Specifically data lookups, transactionality, availability, reliability, and scalability – the latter being more and more important as customers demanded access to their information and money from any point on the globe.

The data access pattern for OLTP is to retrieve a small set of data, usually by doing a lookup on an ID. For example, the account information for a given customer ID. The system also must be able to write back a small amount of information based on the given ID. So the system needs the ability to do fast lookups, fast point inserts, and fast updates or deletes.

Transaction support is arguably the most important characteristic that OLTP offers, as reflected in the name itself. A database transaction means a set of actions that are either all completed, or none of them are completed; there is no middle ground. For example, an ATM has to guarantee that it either gave the customer the money and debited their account, or did not give the customer money and did not debit their account. Only giving the customer money, but not debiting their account, harms the bank; only debiting the account, but not giving the customer money, harms the customer.

Note that doing neither of the actions – not giving the money, and not debiting the account – is an unhappy customer experience, but still leaves the system in a consistent state. This is why the notion of a database transaction is so powerful. It guarantees the atomicity of a set of actions, where atomicity means that related actions happen, or don’t happen, as a unit.

Reliability is another key requirement. ATMs need to be always available, so customers can use one at any time. Uptime for the ATM is critical, even overcoming hardware or software failures, without human intervention. The system needs to be reliable because the interface is with the end customer and banks win on how well they deliver a positive customer experience. If the ATM fails every few times a customer tries it, the customer will get annoyed and switch to another bank.

Scalability is also a key requirement. Banks have millions of customers, and they will have tens of thousands of people hitting the back-end system at any given time. But the usage is not uniform. There are peak times when a lot more people hit the system.

For example, Friday is a common payday for companies. That means many customers will all be using the system around the same time to check on the balance and withdraw money. They will be seriously inconvenienced – and very unimpressed – if one, or some, or all of the ATMs go down at that point.

So banks need to scale to hundreds of thousands of users hitting the system concurrently on Friday afternoons. Hard to predict, one-off events, such as a hurricane or an earthquake, are among other examples that can also cause peaks. The worst case is often the one you didn’t see coming, so you need a very high level of resiliency even without having planned for the specific event that ends up occurring.

These requirements for the OLTP workload show up in many other use cases, such as retail transactions, billing, enterprise resource planning (widely known as ERP), customer relationship management (CRM), and just about any application where an end user is reviewing and manipulating data they have access to and where they expect to see the results of those changes immediately.

The existing legacy database systems were founded to solve these use cases over the last few decades, and they do a very good job of it, for the most part. The market for OLTP-oriented database software is in the tens of billions of dollars a year. However, with the rise of the Internet, and more and more transactional systems being built for orders of magnitude more people, legacy database systems have fallen behind in scaling to the level needed by modern applications.

The lack of scale out also makes it difficult for OLTP databases to handle analytical queries while successfully, reliably, and quickly running transactions. In addition, they lack the key technologies to perform the analytical queries efficiently. This has contributed to the need for separate, analytics-oriented databases, as described in the next section.

A key limitation is that OLTP databases have typically run on a single computing node. This means that the transactions that are the core of an OLTP database can only happen at the speed and volume dictated by the single system at the center of operations. In an IT world that is increasingly about scaling out – spreading operations across arbitrary numbers of servers – this has proven to be a very serious flaw indeed.

OLAP Emerges to Complement OLTP

After the evolution of OLTP, the other major pattern that has emerged is OLAP. OLAP emerged a bit after OLTP, as enterprises realized they needed fast and flexible access to the data stored in their OLTP systems.

OLTP system owners could, of course, directly query the OLTP system itself. However, OLTP systems were busy with transactions – any analytics use beyond the occasional query threatened to bog the OLTP systems down, limited to a single node as they were. And the OLAP queries quickly became important enough to have their own performance demands.

Analytics use would tax the resources of the OLTP system. Since the availability and reliability of the OLTP system were so important, it wasn’t safe to have just anyone running queries that might use up resources to any extent which would jeopardize the availability and reliability of the OLTP system.

In addition, people found that the kinds of analytics they wanted to do worked better with a different schema for the data than was optimal for the OLTP system. So they started copying the data over into another system, often called a data warehouse or a data mart. As part of the copying process, they would change the database schema to be optimal for the analytics queries they needed to do.

At first, OLTP databases worked reasonably well for analytics needs (as long as they ran analytics on a different server than the main OLTP workload). The legacy OLTP vendors included features such as grouping and aggregation in the SQL language to enable more complex analytics. However, the requirements of the analytics systems were different enough that a new breed of technology emerged that could satisfy analytics needs better, with features such as column-storage and read-only scale-out. Thus, the modern data warehouse was born.

The requirements for a data warehouse were the ability to run complex queries very fast; the ability to scale to handle large data sets (orders of magnitude larger than the original data from the OLTP system); and the ability to ingest large amounts of data in batches, from OLTP systems and other sources.

Query Patterns

Unlike the OLTP data access patterns that were relatively simple, the query patterns for analytics are a lot more complicated. Trying to answer a question such as, “Show me the sales of product X, grouped by region and sales team, over the last two quarters,” requires a query that uses more complex functions and joins between multiple data sets.

These kinds of operations tend to work on aggregates of data records, grouping them across a large amount of data. Even though the result might be a small amount of data, the query has to scan a large amount of data to get to it.

Picking the right query plan to optimally fetch the data from disk requires a query optimizer. Query optimization has evolved into a specialty niche within the realm of computer science; there are only a small number of people in the world with deep expertise in it. This specialization is key to the performance of database queries, especially in the face of large data sets.

Building a really good query optimizer and query execution system in a distributed database system is hard. It requires a number of sophisticated components including statistics, cardinality estimation, plan space search, the right storage structures, fast query execution operators, intelligent shuffle, both broadcast and point-to-point data transmission, and more. Each of these components can take months or years of skilled developer effort to create, and more months and years to fine-tune.

Scaling

Datasets for data warehouses can get quite big. This is because you are not just storing a copy of the current transactional data, but taking a snapshot of the state periodically and storing each snapshot going back in time.

Businesses often have a requirement to go back months, or even years, to understand how the business was doing previously and to look for trends. So while operational data sets range from a few gigabytes (GBs) to a few terabytes (TBs), a data warehouse ranges from hundreds of GBs to hundreds of TBs. For the raw data in the biggest systems, data sets can reach petabytes (PBs).

For example, imagine a bank that is storing the transactions for every customer account. The operational system just has to store the current balance for the account. But the analytics system needs to record every transaction in that account, going back for years.

As the systems grew into the multiple TBs, and into the PB range, it was a struggle to get enough computing and storage power into a single box to handle the load required. As a result, a modern data warehouse needs to be able to scale out to store and manage the data.

Scaling out a data warehouse is easier than scaling an OLTP system. This is because scaling queries is easier than scaling changes – inserts, updates, and deletes. You don’t need as much sophistication in your distributed transaction manager to maintain consistency. But the query processing needs to be aware of the fact that data is distributed over many machines, and it needs to have access to specific information about how the data is stored. Because building a distributed query processor is not easy, there have been only a few companies who have succeeded at doing this well.

Getting the Data In

Another big difference is how data is put into a data warehouse. In an OLTP system, data is entered by a user through interaction with the application. With a data warehouse, by contrast, data comes from other systems programmatically. Often, it arrives in batches and at off-peak times. The timing is chosen so that the work of sending data does not interfere with the availability of the OLTP system where the data is coming from.

Because the data is moved programmatically by data engineers, you don’t need the database platform to enforce constraints on the data to keep it consistent. Because it comes in batches, you want an API that can load large amounts of data quickly. (Many data warehouses have specialized APIs for this purpose.)

Lastly, the data warehouse is not typically available for queries during data loading. Historically, this process worked well for most businesses. For example, in a bank, customers would carry out transactions against the OLTP system, and the results could be batched and periodically pushed into the analytics system. Since statements were only sent out once a month, it didn’t matter if it took a couple of days before the data made it over to the analytics system.

So the result is a data warehouse that is queryable by a small number of data analysts. The analysts run a small number of complex queries during the day, and the system is offline for queries while loading data during the night. The availability and reliability requirements are lower than an OLTP system because it is not as big a deal if your analysts are offline. You don’t need transactions of the type supported by the OLTP system, because the data loading is controlled by your internal process.

The NoSQL Work Around

For more information on this topic, read our previous blog post: Thank You for Your Help, NoSQL, But We Got It from Here.

As the world “goes digital,” the amount of information available increases exponentially. In addition, the number of OLTP systems has increased dramatically, as has the number of users consuming them. The growth in data size and in the number of people who want to take advantage of the data has outstripped the capabilities of legacy databases to manage. As scale-out patterns have permeated more and more areas within the application tier, developers have started looking for scale-out alternatives for their data infrastructure.

In addition, the separation of OLTP and OLAP has meant that a lot of time, energy, and money go into extracting, transforming, and loading data – widely known as the ETL process – between the OLTP and OLAP sides of the house.

ETL is a huge problem. Companies spend billions of dollars on people and technology to keep the data moving. In addition to the cost, the consequence of ETL is that users are guaranteed to be working on stale data, with the newest data up to a day old.

With the crazy growth in the amount of data – and in demand for different ways of looking at the data – the OLAP systems fall further and further behind. One of my favorite quotes, from a data engineer at a large tech company facing this problem, is: “We deliver yesterday’s insights, tomorrow!”.

NoSQL came along promising an end to all this. NoSQL offered:

  • Scalability. NoSQL systems offered a scale-out model that broke through the limits of the legacy database systems.
  • No schema. NoSQL abandoned schema for unstructured and semi-structured formats, abandoning the rigid data typing and input checking that make database management challenging.
  • Big data support. Massive processing power for large data sets.

All of this, though, came at several costs:

  • No schema, no SQL. The lack of schema meant that SQL support was not only lacking from the get-go, but hard to achieve. Moreover, NoSQL application code is so intertwined with the organization of the data that application evolution becomes difficult. In other words, NoSQL systems lack the data independence found in SQL systems.
  • No transactions. It’s very hard to run traditional transactions on unstructured or semi-structured data. So data was left unreconciled, but discoverable by applications, that would then have to sort things out.
  • Slow analytics. Many of the NoSQL systems made it very easy to scale and to get data into the system (i.e., the data lake). While these systems did allow the ability to process larger amounts of data than ever before, they are pretty slow. Queries could take hours or even tens of hours. It was still better than not being able to ask the question at all, but it meant you had to wait a long while for the answer.

NoSQL was needed as a complement to OLTP and OLAP systems, to work around the lack of scaling. While it had great promise and solved some key problems, it did not live up to all its expectations.

The Emergence of Modern Databases

With the emergence of NewSQL systems such as MemSQL, much of the rationale for using NoSQL in production has dissipated. We have seen many of the NoSQL systems try to add back important, missing features – such as transaction support and SQL language support – but the underlying NoSQL databases are simply not architected to handle them well. NoSQL is most useful for niche use cases, such as a data lake for storing large amounts of data, or as a kind of data storage scratchpad for application data in a large web application.

The core problems still remain. How do you keep up with all the data flowing in and still make it available instantly to the people who need it? How can you reduce the cost of moving and transforming the data? How can you scale to meet the demands of all the users who want access to the data, while maintaining an interactive query response time?

These are the challenges giving rise to a new workload, operational analytics. Read our upcoming blog post to learn about the operational analytics workload, and how NewSQL systems like MemSQL allow you to handle the challenges of these modern workloads.

Session Management in Nodejs Using Redis as Session Store

$
0
0

Feed: Planet MySQL
;
Author: Shahid shaikh
;

We have covered session management in ExpressJs using global variable technique which of course will not work in case of shared server or concurrent execution of http requests which is most familiar production scenario.

Codeforgeek readers requested to provide solution for these issue and the optimal one is to use external session storage which is not dependent on application requests, answer is Redis cause this is the light weight and easy to use NoSQL database.

In this tutorial i am going to explain how to design and code session oriented express web applications by using Redis as external session storage.

DOWNLOAD

To get familiar with Session handling in ExpressJS I recommend to read our first article here.

Getting started with Redis :

If you have already installed Redis please go to next section. For those of you who are not familiar with Redis here is a little introduction.

Redis is key value pair cache and store. it is also referred as data structure server cause keys can contain List, Hash, sets, sorted sets etc.

Redis is fast cause it work in memory data set i.e it by default stores data in your memory than disk and if you from CS background you very well know CRUD operation on memory is way faster than disk, so is Redis.

if you restart Redis or shut it down you may lose all data unless you enable option to dump those data in hard disk. Be careful !

Installation:

1 : On mac

On Mac if you have brew install then just open up your terminal and type

brew install redis

Make sure you have command line tools installed cause it need GCC to compile it.

If you don’t have brew then please install it. It’s awesome !

2 : On ubuntu

Run following command on Ubuntu and rest will be done

sudo apt-get install redis-server

3 : On Windows

Well Redis does not support Windows ! Hard luck.

Basic REDIS command :

I am going to mention those command which i need to go with this tutorial. For detailed information please visit the nice demo built by awesome Redis team to make you pro in Redis.

1 : Starting Redis server.

Run this command on terminal.

redis-server &

2 : Open Redis CLI too.

Run this command on terminal.

3 : List all Keys.
Run this command on terminal.

4 : Retrieve information regarding particular key.
Run this command on terminal.

GET

Once you have installed Redis, by running first command you should something like this.
Redis start sceeen

Express session with Redis

To add support of Redis you have to use Redis client and connect-redis. Create express-session and pass it to connect-redis object as parameter. This will initialize it.

Then in session middle ware, pass the Redis store information such as host, port and other required parameters.

Here is sample express code with Redis support. Have a look.

Express Session using Redis :
var express = require(‘express’);
var redis   = require(“redis”);
var session = require(‘express-session’);
var redisStore = require(‘connect-redis’)(session);
var bodyParser = require(‘body-parser’);
var client  = redis.createClient();
var app = express();

app.set(‘views’, __dirname + ‘/views’);
app.engine(‘html’, require(‘ejs’).renderFile);

app.use(session({
    secret: ‘ssshhhhh’,
    // create new redis store.
    store: new redisStore({ host: ‘localhost’, port: 6379, client: client,ttl :  260}),
    saveUninitialized: false,
    resave: false
}));
app.use(bodyParser.json());
app.use(bodyParser.urlencoded({extended: true}));

app.get(‘/’,function(req,res){ 
    // create new session object.
    if(req.session.key) {
        // if email key is sent redirect.
        res.redirect(‘/admin’);
    } else {
        // else go to home page.
        res.render(‘index.html’);
    }
});

app.post(‘/login’,function(req,res){
    // when user login set the key to redis.
    req.session.key=req.body.email;
    res.end(‘done’);
});

app.get(‘/logout’,function(req,res){
    req.session.destroy(function(err){
        if(err){
            console.log(err);
        } else {
            res.redirect(‘/’);
        }
    });
});

app.listen(3000,function(){
    console.log(“App Started on PORT 3000”);
});

Notice the code where we are initiating the session. Have a look.

app.use(session({
    secret: ‘ssshhhhh’,
    // create new redis store.
    store: new redisStore({ host: ‘localhost’, port: 6379, client: client}),
    saveUninitialized: false,
    resave: false
}));

If Redis server running, then this is default configuration. Once you have configured it. Store your session key in the way we were doing in previous example.

req.session.key_name = value to set
// this will be set to redis, value may contain User ID, email or any information which you need across your application.

Fetch the information from redis session key.

req.session.key[“keyname”]

Our project:

To demonstrate this i have developed web application which will allow you to register and login and post some status. It’s simple but it will demonstrate you how to handle session using external storage.

Create project folder and copy these code to package.json.

Our package.json :

package.json
{
  “name”: “Node-Session-Redis”,
  “version”: “0.0.1”,
  “scripts”: {
    “start”: “node ./bin”
  },
  “dependencies”: {
    “async”: “^1.2.1”,
    “body-parser”: “^1.13.0”,
    “connect-redis”: “^2.3.0”,
    “cookie-parser”: “^1.3.5”,
    “ejs”: “^2.3.1”,
    “express”: “^4.12.4”,
    “express-session”: “^1.11.3”,
    “mysql”: “^2.7.0”,
    “redis”: “^0.12.1”
  }
}

Install dependencies by running following command.

npm install

Our database:

Once completed let’s design simple database to support our application. Here is the diagram.
database diagram

Database is simple and straight, please create database in MySQL and run following DDL queries.

SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE=‘TRADITIONAL,ALLOW_INVALID_DATES’;

CREATE SCHEMA IF NOT EXISTS `redis_demo` DEFAULT CHARACTER SET latin1 ;
USE `redis_demo` ;

— —————————————————–
— Table `redis_demo`.`user_login`
— —————————————————–
CREATE TABLE IF NOT EXISTS `redis_demo`.`user_login` (
  `user_id` INT(11) NOT NULL AUTO_INCREMENT COMMENT ,
  `user_email` VARCHAR(50) NOT NULL COMMENT ,
  `user_password` VARCHAR(50) NOT NULL COMMENT ,
  `user_name` VARCHAR(50) NOT NULL COMMENT ,
  PRIMARY KEY (`user_id`)  COMMENT ,
  UNIQUE INDEX `user_email` (`user_email` ASC)  COMMENT )
ENGINE = InnoDB
AUTO_INCREMENT = 7
DEFAULT CHARACTER SET = latin1;

— —————————————————–
— Table `redis_demo`.`user_status`
— —————————————————–
CREATE TABLE IF NOT EXISTS `redis_demo`.`user_status` (
  `user_id` INT(11) NOT NULL COMMENT ,
  `user_status` TEXT NOT NULL COMMENT ,
  `created_date` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT ,
  INDEX `user_id` (`user_id` ASC)  COMMENT ,
  CONSTRAINT `user_status_ibfk_1`
    FOREIGN KEY (`user_id`)
    REFERENCES `redis_demo`.`user_login` (`user_id`))
ENGINE = InnoDB
DEFAULT CHARACTER SET = latin1;

SET SQL_MODE=@OLD_SQL_MODE;
SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS;
SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS;

Our Server code

Server file contains application routes, database support and Redis session support. We will first create connect to database and initialize Redis then according to Routes particular action will happen.

/bin/index.js
/**
  Loading all dependencies.
**/

var express         =     require(“express”);
var redis           =     require(“redis”);
var mysql           =     require(“mysql”);
var session         =     require(‘express-session’);
var redisStore      =     require(‘connect-redis’)(session);
var bodyParser      =     require(‘body-parser’);
var cookieParser    =     require(‘cookie-parser’);
var path            =     require(“path”);
var async           =     require(“async”);
var client          =     redis.createClient();
var app             =     express();
var router          =     express.Router();

// Always use MySQL pooling.
// Helpful for multiple connections.

var pool    =   mysql.createPool({
    connectionLimit : 100,
    host     : ‘localhost’,
    user     : ‘root’,
    password : ,
    database : ‘redis_demo’,
    debug    :  false
});

app.set(‘views’, path.join(__dirname,‘../’,‘views’));
app.engine(‘html’, require(‘ejs’).renderFile);

// IMPORTANT
// Here we tell Express to use Redis as session store.
// We pass Redis credentials and port information.
// And express does the rest !

app.use(session({
    secret: ‘ssshhhhh’,
    store: new redisStore({ host: ‘localhost’, port: 6379, client: client,ttl :  260}),
    saveUninitialized: false,
    resave: false
}));
app.use(cookieParser(“secretSign#143_!223”));
app.use(bodyParser.urlencoded({extended: false}));
app.use(bodyParser.json());

// This is an important function.
// This function does the database handling task.
// We also use async here for control flow.

function handle_database(req,type,callback) {
   async.waterfall([
    function(callback) {
        pool.getConnection(function(err,connection){
          if(err) {
                   // if there is error, stop right away.
                   // This will stop the async code execution and goes to last function.
            callback(true);
          } else {
            callback(null,connection);
          }
        });
    },
    function(connection,callback) {
      var SQLquery;
      switch(type) {
       case “login” :
        SQLquery = “SELECT * from user_login WHERE user_email='”+req.body.user_email+“‘ AND `user_password`='”+req.body.user_password+“‘”;
        break;
            case “checkEmail” :
             SQLquery = “SELECT * from user_login WHERE user_email='”+req.body.user_email+“‘”;
            break;
        case “register” :
        SQLquery = “INSERT into user_login(user_email,user_password,user_name) VALUES (‘”+req.body.user_email+“‘,'”+req.body.user_password+“‘,'”+req.body.user_name+“‘)”;
        break;
        case “addStatus” :
        SQLquery = “INSERT into user_status(user_id,user_status) VALUES (“+req.session.key[“user_id”]+“,'”+req.body.status+“‘)”;
        break;
        case “getStatus” :
        SQLquery = “SELECT * FROM user_status WHERE user_id=”+req.session.key[“user_id”];
        break;
        default :
        break;
    }
    callback(null,connection,SQLquery);
    },
    function(connection,SQLquery,callback) {
       connection.query(SQLquery,function(err,rows){
           connection.release();
        if(!err) {
            if(type === “login”) {
              callback(rows.length === 0 ? false : rows[0]);
            } else if(type === “getStatus”) {
                          callback(rows.length === 0 ? false : rows);
                        } else if(type === “checkEmail”) {
                          callback(rows.length === 0 ? false : true);
                        } else {
                      callback(false);
            }
        } else {
             // if there is error, stop right away.
            // This will stop the async code execution and goes to last function.
            callback(true);
         }
    });
       }],
       function(result){
      // This function gets call after every async task finished.
      if(typeof(result) === “boolean” && result === true) {
        callback(null);
      } else {
        callback(result);
      }
    });
}

/**
    — Router Code begins here.
**/

router.get(‘/’,function(req,res){
    res.render(‘index.html’);
});

router.post(‘/login’,function(req,res){
    handle_database(req,“login”,function(response){
        if(response === null) {
            res.json({“error” : “true”,“message” : “Database error occured”});
        } else {
            if(!response) {
              res.json({
                             “error” : “true”,
                             “message” : “Login failed ! Please register”
                           });
            } else {
               req.session.key = response;
                   res.json({“error” : false,“message” : “Login success.”});
            }
        }
    });
});

router.get(‘/home’,function(req,res){
    if(req.session.key) {
        res.render(“home.html”,{ email : req.session.key[“user_name”]});
    } else {
        res.redirect(“/”);
    }
});

router.get(“/fetchStatus”,function(req,res){
  if(req.session.key) {
    handle_database(req,“getStatus”,function(response){
      if(!response) {
        res.json({“error” : false, “message” : “There is no status to show.”});
      } else {
        res.json({“error” : false, “message” : response});
      }
    });
  } else {
    res.json({“error” : true, “message” : “Please login first.”});
  }
});

router.post(“/addStatus”,function(req,res){
    if(req.session.key) {
      handle_database(req,“addStatus”,function(response){
        if(!response) {
          res.json({“error” : false, “message” : “Status is added.”});
        } else {
          res.json({“error” : false, “message” : “Error while adding Status”});
        }
      });
    } else {
      res.json({“error” : true, “message” : “Please login first.”});
    }
});

router.post(“/register”,function(req,res){
    handle_database(req,“checkEmail”,function(response){
      if(response === null) {
        res.json({“error” : true, “message” : “This email is already present”});
      } else {
        handle_database(req,“register”,function(response){
          if(response === null) {
            res.json({“error” : true , “message” : “Error while adding user.”});
          } else {
            res.json({“error” : false, “message” : “Registered successfully.”});
          }
        });
      }
    });
});

router.get(‘/logout’,function(req,res){
    if(req.session.key) {
    req.session.destroy(function(){
      res.redirect(‘/’);
    });
    } else {
        res.redirect(‘/’);
    }
});

app.use(‘/’,router);

app.listen(3000,function(){
    console.log(“I am running at 3000”);
});

Explanation:

When user provides login credentials then we are checking it against our database. If it’s successful then we are setting database response in our Redis key store. This is where Session has been started.

Now as soon as User go to homepage, we are validating the session key and if it is there, then retrieving the user id from it to fire further MySQL queries.

When user click on Logout, we are calling req.session.destroy() function which in turn deletes the key from Redis and ends the session.

Views ( Index.html and Home.html )

Here is our home page code.

/view/index.html
<html>
<head>
<title>Home</title>

<link rel=“stylesheet” href=“https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css”>

<link rel=“stylesheet” href=“https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap-theme.min.css”>
<script src=“https://code.jquery.com/jquery-1.11.3.min.js”></script>

<script src=“https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js”></script>
<script>
$(document).ready(function(){
    $(“#username”).hide();
    $(‘#login-submit’).click(function(e){
      if($(this).attr(‘value’) === ‘Register’) {
        $.post(“http://localhost:3000/register”,{
               user_name : $(“#username”).val(),
               user_email : $(“#useremail”).val(),
               user_password : $(“#password”).val()
             },function(data){
        if(data.error) {
            alert(data.message);
        } else {
            $(“#username”).hide();
            $(“#login-submit”).prop(‘value’,’Log in’);
        }
    });
    } else {
        $.post(“http://localhost:3000/login”,{
                   user_email : $(“#useremail”).val(),
                   user_password : $(“#password”).val()
                   },function(data){
            if(!data.error) {
                window.location.href = “/home”;
            } else {
                alert(data.message);
            }
        });
    }
    });
    $(“#reg”).click(function(event){
        $(“#username”).show(‘slow’);
        $(“#login-submit”).prop(‘value’,’Register’);
        event.preventDefault();
    });
});
</script>
    </head>
    <body>
    =“navbar navbar-default navbar-fixed-top”>
    <div class=“navbar-header”>
    <a class=“navbar-brand” href=“#”>
        <p>Redis session demo</p>
    </a>
    </div>
  <div class=“container”>
    <p class=“navbar-text navbar-right”>Please sign in</a></p>
  </div>
</nav>
<div class=“form-group” style=“margin-top: 100px; width : 400px; margin-left : 50px;”>
    <input type=“text” id=“username” placeholder=“Name” class=“form-control”><br>
    <input type=“text” id=“useremail” placeholder=“Email” class=“form-control”><br>
    <input type=“password” id=“password” placeholder=“Password” class=“form-control”><br>
    <input type=“button” id=“login-submit” value=“Log In” class=“btn btn-primary”> <a href=“” id=“reg”>Sign up here </a>
    </div>
    </body>
</html>

Here is how it looks.

Home page - Redis session

/view/home.html
<html>
<head>
<title>Home</title>

<link rel=“stylesheet” href=“https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css”>

<link rel=“stylesheet” href=“https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap-theme.min.css”>
<script src=“https://code.jquery.com/jquery-1.11.3.min.js”></script>

<script src=“https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js”></script>
<script type=“text/javascript”>
    $(document).ready(function(){
        $.get(“http://localhost:3000/fetchStatus”,function(res){
            $.each(res.message,function(index,value) {
                $(“#showStatus”).append(‘You have posted

+value.user_status+
);
            });
        });
    $(“#addNewStatus”).click(function(e){
        e.preventDefault();
            if($(“#statusbox”).text !== “”) {
                $.post(“/addStatus”,
                                  { status : $(“#statusbox”).val() },
                                   function(res){
                    if(!res.error) {
                        alert(res.message);
                    }
                })
            }
        });
});
</script>
</head>
<body>

Here is how it looks.

Home page - redis session

How to run:

Download code from Github and extract the file. Make sure you have installed Redis and created database.

If your database name and MySQL password is different then please update it on bin/index.js file.

Once done type npm start on terminal to start the project.

Start the script

Now open your browser and go to localhost:3000 to view the app. Register with your email id and password and then Login with same. After login, have a look to Redis using commands mentioned above.

Redis key set

Now logout from System and check for same key. It should not be there.

After logout

This is it. Express session with Redis is working.

Conclusion:

Redis is one of the popular key storage database system. Using it to store session is not only beneficial in production environment but also it helps to improve performance of system.

Viewing all 521 articles
Browse latest View live