Shared storage for Atlassian Data Center tools

Objective

The following information is meant to help you start conversations about your internal storage design and needs for use with Atlassian Data Center tools. We've also provided some troubleshooting tips to assist you as needed.

Storage is not a simple subject, it's a topic that requires thought and careful implementation in order to build a successful design. Engineering teams have been working on shared, clustered, and highly available storage for decades. It is our intent to share some of our knowledge and practices for using or deploying storage services and solutions in your environment.

Please review the appropriate hardware requirements pages to determine the needs for your Atlassian product(s).

Product	Hardware requirements
Bitbucket DC	Bitbucket DC supported platforms
Confluence DC	Confluence DC supported platforms
Jira/JSD DC	Jira DC supported platforms

A brief introduction to iSCSI SAN

Internet Small Computer Systems Interface (iSCSI, pronounced eye-skuh-zee) is an IP-based storage networking standard for linking data storage facilities. It provides block-level access to storage devices by carrying SCSI commands over a TCP/IP network. iSCSI is used to facilitate data transfers over intranets and to manage storage over long distances. It can be used to transmit data over local area networks (LANs), wide area networks (WANs), or the Internet and can enable location-independent data storage and retrieval.

The term SAN stands for storage area networks. The SAN protocol allows organizations to consolidate storage into storage arrays while providing clients (such as databases and web servers) with the appearance of having locally attached SCSI disks. It mainly competes with Fibre Channel, but unlike traditional Fibre Channel which usually requires dedicated cabling, iSCSI can be run over longer distances using existing network infrastructure. iSCSI was pioneered by IBM and Cisco in 1998 and submitted as a draft standard in March 2000.

An iSCSI SAN therefore is merely a SAN dedicated locally over TCP/IP. You can see an example of a typical iSCI SAN in this diagram.

In a data center, you would never deploy storage traffic on the same network, switch, or even fabric as standard Ethernet traffic is used by applications and users. A typical design will include dedicated infrastructure for the SAN. As shown in linked diagram above, standard networking operates on Fabric A, which may include clients and applications. Fabric B and C are dedicated to storage.

Due to this, we have specialized networks just for storage I/O including:

Dedicated PCI-E lanes for network adaptors
Specialized ASICs on storage type
Low latency
Packet and frame throughput
Jumbo frames (more I/O in a a single frame!)
Dedicated network adaptors for the traffic type
Dedicated network switches for traffic types
Fibre Channel
InfiniBand
iSCSI
Backplane switching over several ASICs
Fabric networks without routing (No overhead, less latency, no switch or router traversal)
No VLANs

This design is very common, but also used in Network File System (NFS) networking for virtualization and high-performance clustering to support thousands of virtual machines.

This is only a brief overview of the subject, but it shows that a lot of thought is put into designing and delivering high-speed storage that offers high-throughput traversal of data in low-latency networks. It requires considerable effort to deliver effective storage and access solutions that will result in optimal performance for your applications.

So why do we need to understand this better?

In the application space, you will need to provide shared network storage for Data Center products. The problems you may face are directly related to storage configuration and deployment. For every millisecond that a response takes that is less than optimal, the longer it will take the application to respond. This can have a negative effect on the overall perception of how fast the application responds for your end users.

NFS can be easily deployed by customers on Windows and Linux, or easily shared out from NAS devices, or even directly from your SAN devices. These solutions, while less costly, will tend to work in the short term, but will show signs of sluggishness as you begin to scale into true enterprise scalability. In order to help you provide better solutions for your environment, we have spent time working on solutions to assist in your configuration.

Solutions & guidance considerations

Not every solution provided herein will be possible for everyone, and we are aware of these limitations. However, working on even some of these changes could provide as much as a 15% performance improvement in storage. Considering that shared storage is used for indexing, attachments, and other aspects of Data Center products, this can significantly improve end-user experience as well.

Build or attach to dedicated storage networks

Avoid having application or “LAN” traffic on your storage network
Use storage-optimized NICs and switches in the separate SAN

Use dedicated NICs on this network with iSCSI or storage optimizations

Use virtual adaptors connected to a storage vSwitch
Perform hardware offloading to the physical NIC
Full duplex settings on switch ports and NICs without auto-negotiation

“Jumbo frames” on the entire network

Every point in a storage network must use the same or larger frame size (MTU 9000+)

If the MTU for jumbo frames on this network is not correct, you may incur data corruption and see no improvements

Avoid IP fragmentation
More I/O can be sent in a single frame
This includes physical NICs on the server, switch ports, virtual switches, and virtual adaptors

Eliminate VLANs and routing

VLANs add Ethernet overhead, it’s best to avoid them
As soon as you introduce routing, you incur a performance penalty for every packet
This means subnets/availability zones, etc.

Use multiple network paths and load balance

Load balance at the switch and NIC

Optimize settings

Use FAST handshaking
Use TCP recycling
Ensure NIC frame buffers are optimized on NICs/virtual adaptors and offloaded to hardware (TSO)
- Not possible in cloud deployments
Use high-performance vNICs such as VMNETX3
Match the stripe/block size to match the provisioned LUN/Volume/RAID settings in rsize/wsize
- defaults may be too big or too small, adjust these to suit the desired workload
Provision stripe sizes for storage
Enable or disable RX/TX flow control pending your device/NFS server recommendations
Tune the Linux kernel/TCP window scaling
- Can be done via Tuned Profiles

Use the correct tools and devices

Use 10 GbE+ Networking
Leverage the correct storage
- Read-optimized
- Write-optimized
- NVMe
- Write-though I/O
Use enterprise-grade Linux with support

A simple design for this on vSphere with dedicated virtual switches, network interfaces, larger frames, and reduced latency is shown below:

Key benefits:

Dedicated networks for application traffic (users/client side)
Dedicated networks for NFS/storage/clustering (node to node or server to storage)
Reduced latency
Increased bandwidth
Better availability
Better scalability

Roll your own custom NFS

Linux based NFS servers are powerful, scalable, and simple to deploy. Using the above guidelines, you can optimize your deployment for different workloads and requirements. Each Atlassian application has different workload and I/O requirements, so it’s important to understand the I/O workloads to work with your infrastructure teams to get the most out of your environment.

Application	IO profile	Disk recommendation
Bitbucket	Very small (<1mb, random IO)	NVMe, SSD, hybrid
Bamboo	Very small (<1mb), random IO + sequential IO (artifacts)	NVMe, SSD, hybrid
Jira	Small/medium (1-4mb), random IO	NVMe, SSD, hybrid
Confluence	Medium/large, random IO + sequential IO	NVMe, SSD, hybrid, RAID HDD

This is a high level example. Use cases and workloads can change by users, projects, use of applications, business model, and industry.

Linux environments

Using an enterprise and production-grade Linux is highly recommended, as this will provide better support overall. Typically, NFS servers perform network/IO passthrough to disk. This is mostly a kernel/CPU intensive operation, thus requiring minimal memory footprints.

A Linux NFS server could be made from the following specs:

Profile	CPU	Memory	Notes
Small	4vCPU	4GB+	Small instances Single node
Medium	8vCPU	8GB+	Medium-sized instances 2-3 application nodes
Large	16vCPU+	32-48GB+	Large High volume Several applications/nodes

Linux will consume as much memory as possible to provide filesystem and object caching by default. More memory can increase these caches.

Test different profiles and workloads to see which profile suits which application and workload for your tools within your own environment.

A note on Git workloads

Git can be a very resource intensive tool. It can heavily rely on CPU cores, CPU frequency, and system memory, but the greatest factor is disk I/O. Typical workloads on a NAS might be a team storing videos, photos, or large documents or backups. This is typically “sequential IO” and most hard drives can do this pretty well. Storage blocks get lined up, and written in a single “spin” of a HDD.

For example: copying a movie to a hard drive might see 100Mbps+ transfer speeds in disk I/O.

Git, however, behaves the opposite. Git typically handles many small files, which is harder for a traditional disk or RAID system to keep pace with. Benchmarking Git performance can become very difficult and can stress most storage systems due to the random access nature and small size of the file transfers.

For example: copying 100,000 very small files might see 1Mbps+ transfer speeds in disk I/O.

Therefore, it's better to work with disks that can access data with speed for I/O. Solid state drives (SSDs) typically excel over traditional hard disk drives (HDDs) for these workloads. The fastest drives, non-volatile memory express (NVMe) or Non-Volatile Memory Host Controller Interface Specification (NVMHCIS), would be the best recommendation in this area as these excel with parallelism, concurrency, and typically have much greater access and lower latency to the raw data on the drive. Most of today's computers come with NVMe drives and the access times and responsiveness are ideal for data. The downside to NVMe is that it can be much more expensive than either HDD or SSD, but the cost is worth the performance for ideal access.

Infrastructure teams may assist in getting the right solution for your storage for those tools that will access Git.

Troubleshooting tools and documentation

NFS allows for extensible and flexible network-based storage. However, you may run into issues with performance that present challenges to solve. When troubleshooting NFS, you should be aware that there are several teams and components that would need to be verified in order to potentially identify and resolve issues. In order to assist you, the list below (not comprehensive) is provided to start the conversation internally to configure your response teams should these issues arise.

Team	Roles	Ownership	Tools
Network engineering	Network management Network security Network capacity Design & availability	Firewalls, routers IPS & IDS Switching OSI Layers 1-4~ QoS & Bandwidth	Wireshark TCPDump Netstat Switch/router OS CLI and tools
Infrastructure/virtualization	Provide virtual machines Provide operating system & templates Monitoring and alerting	Virtualization stacks vSphere, Hyper-V, Xen GCP, Azure, AWS Clustering and HA Compute capacity (CPU & Memory) Sometimes manage storage too	Platform-specific tools VMware monitoring NMON Networking tools (tcpdump, wireshark, etc.)
Storage & backups	Provide storage to infra teams Provide backup targets & solutions Manage data integrity, availability, capacity	SAN PureStorage, EMC, HPE 3PAR, IBM, NetApp NAS CIFS/NFS Share File shares for teams Backups Veeam, CommVault, Backup-Exec	nfsstat nfsiostat tcpdump & wireshark NMON SAN & storage utilities

Generally, pulling these teams into a problem or incident with NFS will get the right people in the room and conversation. This is a rough example, teams and ownership may vary in your organization.

NFS documentation

There are many guides to assist you with configuring, implementing, and troubleshooting NFS in your environment. Below are a few linked articles that are available to assist you as needed:

Open documents:

If you are a RedHat client, and have RedHat Support Enablement Number (SEN), you may also review the following:

How can I improve the performance of my RHEL NFS client? - Red Hat (Improving NFS Client Performance)
Red Hat Enterprise Linux Network Performance Tuning Guide - Red Hat (Network Performance Tuning Guide)
How to begin NFS Debugging - Red Hat Customer Portal (NFS Debugging)

Tweaks and tuning

NFS daemon

By default, the NFS service is started with 8 threads. This can be adjusted if you have a busy, highly concurrent NFS workload. Adjusting this value (doubling until diminishing returns) could provide additional throughput and better latency in Linux environments.

1/etc/sysconfig/nfsd

The default is commented out and can be adjusted:

1#RPCNFSDCOUNT=82RPCNFSDCOUNT=163RPCNFSDCOUNT=324RPCNFSDCOUNT=645RPCNFSDCOUNT=128

Tuned

Tuned is a RHEL/CentOS service that tweaks the kernel and system parameters for different workloads. The tuned service can be installed quickly and easily:

1# Install & Enable2yum install tuned3systemctl enable tuned4systemctl start tuned5 6# Tuned CLI7tuned-adm list profile8tuned-adm profile thoughput-latency9

RedHat, Oracle, and IBM provide several profiles for different workloads. Using this tool can help apply some of the best practices within Linux for a target workload. More profiles are found here: tuned/profiles at master · redhat-performance/tuned

You can even write your own profiles, building on an existing one with current tweaks which could be recommended by Premier Support or development.

Troubleshooting tools

Handy tools for troubleshooting NFS performance

More details on using these are found below in the Red Hat documentation.

ethtool

ethtool is a Linux tool for querying and changing NIC settings and very low levels.

1ethtool -S ethX # Displays statistics and errors2ethtool -a ethX # Displays how flow control is performed or enabled3ethtool -K ethX # Displays how offloading is configured

netstat

netstat prints data about open connections and protocol stack stats, including device info, TCP socket info, and Unix sockets.

1netstat -s # Provides error counters

Dropwatch (Redhat)

Dropwatch is a tool used to determine where packets are dropped by the kernel. It's useful in performance cases to narrow down the problem area.

1dropwatch -l kas2> Start3####4Kernel monitoring activated.5Issue Ctrl-C to stop monitoring62 drops at ip_forward+1b7 (0xffffffffbd809777)74 drops at ip_forward+1b7 (0xffffffffbd809777)81 drops at sk_stream_kill_queues+48 (0xffffffffbd7a2158)91 drops at __brk_limit+25100c6 (0xffffffffc0fe70c6)101 drops at ip_forward+1b7 (0xffffffffbd809777)112 drops at ip_forward+1b7 (0xffffffffbd809777)123 drops at ip_forward+1b7 (0xffffffffbd809777)

nfsstat

nfsstat is a performance and stat tracker for NFS/RPC calls. It's useful on the server to diagnose naughty NFS servers.

1nfsstat -r -l # Only RPC2nfsstat -s -l # Only Server Stats3nfsstat -c -l # Only Client Stats

nfsiostat

nfsiostat provides client-side mount statistics. It includes details like:

Operations per second
RPC backlog/queue
kB read/write per second
kB/op read/write (size per read/write)
Retransmissions
Average round trip time
Average exec time (duration in kernel)
Average queue (ms)

1man nfsiostat1nfsiostat [[<interval>] [<count>]] [<options>][<mount_point>]

tcpdump

tcpdump is useful for capturing network data between the NFS client and NFS server. It can also validate packet sizes, frames lengths, timings, and a lot more. It can be parsed by Wireshark:

1tcpdump -vvv -n -i ethX host localhost and nfs-server.com

nmon

nmon is one of the oldest and most powerful performance analysis tools for Linux. Originally developed by IBM and Oracle, this is a very powerful tool with next to zero overhead on the host machine.

nmon has two modes, graphical/TUI shell output and background logging mode. Background logging mode allows for the collection of performance stats which can be parsed later with tools like Excel or Java output tools.

1yum install epel-release -y && yum install nmon

To capture data with 1 second sample intervals:

1nmon -f -t -s 1

This writes a .nmon file to the working directory, which can then be parsed with the following tool.

GitHub - nmonvisualizer/nmonvisualizer: A Java GUI tool for analyzing NMON and other system stats files

.nmon files or reports can be shared with your support teams easily.

Measuring success and changes

In order to understand the outcome of any changes made within your environment, you will need to measure data to determine what is effective. For that reason, we do recommend that all such changes are made in a lower environment prior to placing such changes into production. We recommend thorough and heavy testing to tune and configure your network, client, and server environment based on your workloads.

Tips

Always perform benchmarks
Create tests and benchmarks based on your use case and workloads
- Examples might include the transfer of lots of small Bitbucket repos, with many small files vs. large photos/documents in Confluence Data Center
- Simulate concurrency and user counts
Measure changes between settings and configurations
Involve infrastructure teams early and outline the application workload and requirements
Be sure to document test results, as some tweaks may negatively impact performance

Conclusion

Every network is unique, and your workloads may dictate different solutions. It is important to understand that storage is a major component in Data Center deployments and that greater throughput with lower latency will result in a more performant application.

Was this content helpful?

Connect, share, or get additional help

Atlassian Community