Stand Aside, Sysadmin

Manly yes, but I like it too.

Ubuntu 10.04.3 LTS grub-install failure on software RAID

Ubuntu 10.04.3 LTS has an apparent bug when running grub-install onto the MBR during the post-install phase. Neither grub2 nor grub legacy want to install. To get grub to install, you need to chroot into the installed environment (takes me back to my Gentoo days..) update, then retry installing legacy grub.
To summarise:

Press CTRL-ALT-F2 to switch to PTS/2, then [Enter] to activate the console
Mount virtual filesystems:

mount --bind /proc /target/proc
mount --bind /dev /target/dev

Chroot into the environment, update and install grub:

chroot /target
bash
apt-get update
apt-get -y dist-upgrade
apt-get install grub
CTRL-D
CTRL-D

Now switch back to the ncurses installer on PTS/1 (CTRL-ALT-F1) and re-run the grub install step, making sure to select legacy grub.

Apparently a fix has made it into 10.04.5..

SVN::Client and Magical Error Codes

The Perl SVN::Client module has a tendency to produce some quite obscure error messages, such as this beauty that cropped up recently, generated when calling the status callback function.

perl: subversion/libsvn_subr/path.c:114: svn_path_join: Assertion `svn_path_is_canonical(base, pool)

A bit of digging with Data::Dumper revealed that this was caused by a path that was inadvertently postfixed with two sequential forward slashes – which the native C client doesn’t seem to mind at all..

Nothing Lasts Forever

To be honest, I saw it coming months ago.
We’ve had a lot of good times together over the years,
But I guess nothing is forever.

This has been a long time coming and just can’t delude myself any longer that things could ever go back to the way they were.
We should have just faced up to things and ended it sooner; it would have saved us both some tears.
I’m sorry to say it but I’ve met someone new.

She’s skinnier and younger, and some would say prettier,
But I can’t help thinking she reminds me of you.
Or at least what you once were.
Goodbye del.icio.us, It’s over.

New Puppet Module – tsmclient

I wrote a puppet module early last year for managing TSM client installation, but it was far from perfect and still required the manual step of connecting to the server for the first time to store the password.
This evening, a couple of new servers going into production prompted me to re-write the whole thing and make it better.

Our process for adding new nodes to backup now consists of 3 steps.

Add node to backup server:

reg node $hostname $password userid=none contact="$contact" emailaddress="$emailaddress"

Add node to schedule:

def assoc $policy_dom $schedule_name $hostname

Add the password to the node manifest under the $tsmpassword variable, and add the tsm::client class to the server.

Done.

Our infrastructure has multiple sites, each with their own TSM server, and those TSM servers listen on different ports.
To automate client configuration, we call on 4 facter modules: tppnetwork, tppsite, tsmserver, and tsmport.

[rclark@vm242|SITE2 ~]$ facter | egrep '^tpp|^tsm'
tppnetwork => Net210_LAN
tppsite => SITE2
tsmport => 1510
tsmserver => tsm-site2.internal.example.com

tppnetwork is a facter module with that takes the IP address of a node and runs it through a bunch of regexes matching various vlans and networks that we run (including amazon address space). This module is used by lots of other things, not just tsmclient.

tppsite is a facter module that takes the output of tppnetwork, and from that determines which site that network belongs to. Also, used by many other things.

tsmserver takes the output of tppsite, and from that gives the hostname of the TSM server for that site.

tsmport takes the output of tsmserver, and gives the port number that that server is configured to listen on.

So, from these bits of information we have the basic pieces of information that we need to configure our TSM client.

The storage of the password for the first time is done by first detecting the need for a password via an authentication failure – to get this to work without requiring input on STDIN is a bit of a hack – running

dsmc query session </dev/null | /bin/grep ^ANS1025E"

will force a ‘query sess’ to run in background mode, then we grep for a line beginning with an authentication failure error code. If this is detected, then running

dsmc set password $tsmpassword $tsmpassword

will be called, storing the password for the client. This has the added bonus of making mass node password changes easier in future – just change it on the server, and then update the $tsmpassword variable in your pupppet configuration.

tsm::client is available on github.

Nagios Plugin – check_lotus_mem_opinion.pl

Wrote a new nagios plugin the other day. It was bothering me that there are checks out there for lotus/domino memory usage based on bytes used/free, but none that ask poor old domino how it feels about the matter.
At OID 1.3.6.1.4.1.334.72.1.1.9.4.0 it quietly makes it’s opinion known but no one has really paid it much attention until now..
Returns either ‘Plentiful’ (OK), ‘Normal’ (WARNING) or ‘Painful’ (CRITICAL).
SNMPv3 is not really supported as I threw the thing together in all of about 5 minutes, but v1 and v2 are – we tend not to use SNMPv3 anyway due to lack of support for 64 bit counters – feel free to fork it on github.

Adding an objectType to an Existing LDAP Object using Powershell

I’ve recently been spending a bit of time getting to know Powershell a little better, and I have to say I’m quite enamoured with it. While not as flexible or powerful as say, perl, it’s definitely a giant leap in the right direction on Microsoft’s part.

For managing contacts and contactgroups in nagios, we use several custom objectTypes and attributes held in LDAP (in this case, Active Directory).
For contacts, we have a custom objectClass called ‘nagiosContact’, which has several mandatory attributes: ‘nagiosContactAlias’, ‘nagiosHostNotificationOptions’, ‘nagiosHostNotificationPeriod’, ‘nagiosHostNotificationCommands’, ‘nagiosServiceNotificationOptions’, ‘nagiosServiceNotificationPeriod’ and ‘nagiosServiceNotificationCommands’. This objectClass needs to be added to an existing user object in order for our contact creation scripts to pull the needed information from LDAP.

The usual way of adding attributes to an object in Powershell is via the ‘Put’ method. This works fine when adding existing attributes to an object, but if we want to add an objectClass that includes mandatory attributes it won’t work; as you can’t add the objectClass without having the mandatory attributes present, but then you can’t add the mandatory attributes without first having the objectClass applied to the object. A classic Chicken/Egg scenario.

PutEx is a special method that allows for handling an array of attributes. It takes input in the form: update flag, attribute name, array of values to set/unset. Changes are not written until SetInfo() is called. The chicken and the egg are created together and God saw that it was good.

Update Flags are an integer between 1-4 with the following

1 - ADS_PROPERTY_CLEAR -  Remove all value(s) of the attribute.
2 - ADS_PROPERTY_UPDATE - Replace the current values of the attribute with the ones passed in. This will clear any previously set values.
3 - ADS_PROPERTY_APPEND - Add the values passed into the set of existing values of the attribute.
4 - ADS_PROPERTY_DELETE - Delete the values passed in.

This example script shows PutEx in operation, adding our custom objectType and all needed attributes to an existing user object.

# modUserNagios.ps1
# Makes an existing user into a nagios contact

# set modify type
[int] $ADS_PROPERTY_CLEAR       = 1
[int] $ADS_PROPERTY_UPDATE     = 2
[int] $ADS_PROPERTY_APPEND     = 3
[int] $ADS_PROPERTY_DELETE      = 4

$objClass = "nagiosContact"
$nagiosContactAlias = "Phoebe Foo"
$nagiosHostNotificationOptions = "d,u,r,s,f"
$nagiosHostNotificationPeriod = "extended_workhours"
$nagiosServiceNotificationOptions = "w,u,c,r"
$nagiosServiceNotificationPeriod = "extended_workhours"
$nagiosHostNotificationCommands = "host-notify-html-email"
$nagiosServiceNotificationCommands = "service-notify-html-email"
$objUser = [adsi] "LDAP://CN=Phoebe Foo,OU=Infrastructure,OU=Users,OU=SiteName,DC=ad,DC=example,DC=com"

$objUser.PutEx($ADS_PROPERTY_APPEND, "objectClass", @("$objClass"))
$objUser.Put("nagiosContactAlias", "$nagiosContactAlias")
$objUser.Put("nagiosHostNotificationOptions", "$nagiosHostNotificationOptions")
$objUser.Put("nagiosHostNotificationPeriod", "$nagiosHostNotificationPeriod")
$objUser.Put("nagiosHostNotificationCommands", "$nagiosHostNotificationCommands")
$objUser.Put("nagiosServiceNotificationOptions", "$nagiosServiceNotificationOptions")
$objUser.Put("nagiosServiceNotificationPeriod", "$nagiosServiceNotificationPeriod")
$objUser.Put("nagiosServiceNotificationCommands", $nagiosServiceNotificationCommands")
$objUser.SetInfo()

Vim-Friendly Keybinding in X on Sun Keyboards

Since last November, I’ve been the proud owner of a Happy Hacking Pro 2 keyboard. Expensive, geeky, and almost unusable for anyone that hasn’t already spent a decent amount of time with it; I have come to the conclusion that I am in love with it. Never one for staying faithful to one piece of hardware for too long, I also have a Sun Type 6 keyboard lying around that I go back to for old time’s sake (it was my main keyboard for over two years).

If you’re a half-decent sysadmin/coder/computer user, then the chances are that you spend half your life in vim (and spend the other half wishing that you were in back in vim). The escape key is a pretty important key that gets a lot of usage (as it’s the key to exit insert mode and return to command mode, among other things). Yes, I know you can re-map this function to another key or use ctrl-c; but there’s something pretty satisfying about hitting that key – perhaps one of the reasons why PFU sell a big red escape key for the HHK2 for $4.50, and perhaps why I was suckered into actually buying it.

The Sun keyboard has a giant help button on the top-left of the keyboard, right next the the escape key. It seems that it’s only function is to annoy me by opening the Gnome help dialogue whenever I miss-hit escape. Luckily, X allows you to remap any key so I re-mapped this seemingly useless key to be another escape key:

To make the change system-wide, edit the file /etc/X11/Xmodmap (create a new one if it doesn’t exist already).

! Bind SunHelp to escape key
keysym Help = Escape

To make the change local to your own account only, place the above configuration in the dotfile: ~/.xmodmap
Then either restart X, or type xmodmap /etc/X11/Xmodmap to load it immediately.

To get a full list of all available keys: xmodmap -pk
To capture keyboard input using the X Event tester: xev

Online Backups in Lotus Connections 3.0 Within a Clustered Environment

I’ve recently been involved in a fairly high-visibility pilot deployment of Lotus Connections 3.0 for a large corporate.
Usually I don’t get involved in big heavy IBM software wherever I can avoid it but, but as my involvement in the project would mainly centre around configuration/tuning of RHEL, Apache IHS, and touching on OpenLDAP TDS, I didn’t jump back straight away. It’s also given me the chance to get to know the terribly flabby and over-complex fellow that is IBM WebSphere a whole lot better.

And so onto backup:

The first statement we received from the IBM Connections ‘Guru’ was that Connections cannot be backed up on-line. All databases and application servers must be stopped. Obviously, in a global deployment which is going to be accessed from many different time zones, nightly downtime for backups isn’t acceptable.
The DB2 consultant was perplexed at this requirement – with archive logging enabled, there is no need to shut down the database to create a backup, online backups are perfectly acceptable and widely used.
So, we pushed back and asked them to explain this logic. The response was that the dependency lies with the Files and Wikis applications – both store files on the filesystem, and the metadata for those files is stored within the DB2 database; so in order to take a solid backup, there must be no opportunity for a user to delete or otherwise modify files within the backup window for the application server’s filesystem and the database.
We were then pointed to some docs for Wikis and Files that showed how to use the jython interpreter via the wcadmin interface to pause the file deletion tasks.

Great, I thought, I know a bit of python so it should be pretty straightforward so just script these actions to be called as pre and post scripts, so that when the DB2 backup is called it will SSH to the deployment manager and pause these tasks, dump the databases and application file-storage, then start the tasks running again.
So, I setup a proof-of-concept on a couple of VM’s, tested it and it all worked swimmingly. Problem solved.

And so onto the deployment in the pre-production environment. Multiple application servers serving multiple application clusters, using shared NFS storage for Files and Wikis data, handled by a single deployment manager. Got it all configured and.. FAIL.

WASX7209I: Connected to process "dmgr" on node pgplnkdwdm0001CellManager01
using SOAP connector; The type of process is: DeploymentManager
WASX7303I: The following options are passed to the scripting environment
and are available as arguments that are stored in the argv variable:
"[/opt/IBM/WebSphere/AppServer/profiles/Dmgr01]"
1:
WebSphere:cell=lpgplnkdwdm0001Cell01,name=ActivityService,type=LotusConnections,node=lpgplnkdwas0001Node01,process=activities
2:
WebSphere:cell=lpgplnkdwdm0001Cell01,name=ActivityService,type=LotusConnections,node=lpgplnkdwas0002Node01,process=activities2
Which service do you want to connect to?

When there is more than one node serving that application (as will generally happen when you have a cluster), the script prompts the user to choose which one to connect to.
All I could find relating to this issue was this one un-resolved question on the IBM Dev forums.

So, I started digging around the python modules to see where this was coming from; as well as the main lotusConnectionsCommonAdmin module, each application has it’s own python administration module. The application administration modules are more of a bastardisation of a module and a script. They contain classes, and are generally called in the same way as a module would be, but also have a requirement when there is more than one node available to it, to interact directly with the user and ask which service to connect to.. from stdin.
This is bad programming 101. A module should be a module, and hence not interact directly with a user – if there is a requirement to interact directly with a user, this should be done by a script. Aside from this fact, I can’t believe that IBM have not thought this through at all – generally speaking, backups tend not to be carried out manually by humans. It’s not even a new issue, as it existed in LC 2.5.

So, the options are: Use expect (which is an arcane and dark art – and also quite a ‘hacky’ solution). Or, modify the ‘module’ to not be so stupid.
I chose the latter.

To avoid any future updates overwriting these modules and breaking functionality, I made a copy of the two modules to $module_unattend.py which would be called by my pre and post backup scripts. With that being said, the modifications are very minor, and safe – if they don’t detect the existence of the ‘serviceNum’ integer, they will prompt the user as normal – and so you can safely overwrite the existing module if you’re feeling brave.

As Lotus Connections code is proprietary, I’ve only uploaded patches for the admin modules, which you can find here.
Also included are pause and resume scripts for each application that query the deployment manager for all servers running those applications, and then cycle through each one, actioning the request.

Hopefully IBM fix this issue in later versions; as much as I generally dislike software that is bloated/heavy, java-based and/or proprietary; with the release of version 3.0 (at least from an end-users perspective), Connections has gone from “tries hard, could do better” to a product that almost makes you believe for a second that all of the hype surrounding “Collaboration and Social Computing in the Workplace” isn’t just a set of new clothes for the emperor.

Hash Sum Mismatch When Using Aptitude Through a Proxy

Internet access for my DMZ servers is provided by Squid running on my Vyatta router in transparent mode.

This is great for conserving bandwidth when downloading packages, but it has the undesirable effect of occasionally giving the error message: “Hash Sum Mismatch” when updating the package index on the servers behind the proxy – this can be caused by a corrupt or outdated version of the package indexes in cache.

Other than flushing the cache on the proxy, the other workaround for this is to force aptitude to update it’s index from scratch: sudo apt-get update -o Acquire::http::No-Cache=True

RHCE Exam: Brutal

RHCE requirements: score of 70 or higher on RHCT components (100 points)
score of 70 or higher on RHCE components (100 points)

RHCT requirement: score of 70 or higher on RHCT components (100 points)

RHCT components score: 92.6%
RHCE components score: 68.8%

RHCT Certification: PASS
RHCE Certification: NO PASS

Follow

Get every new post delivered to your Inbox.